Fuzzy Matching

Fuzzy matching is the process of taking records that are not an exact match and using probabilistic matching to attribute them to the correct ID. Fuzzy matching is most commonly used as part of a set of deduplication rules.

This post covers three methods of fuzzy matching in SQL:

  • RLIKE — SQL function that uses regular expressions to do text based matching.
  • Levenshtein — a Hive function that calculates similarity between two strings
  • SOUNDEX — function that matches based on how similar two strings sound.

These methods are often best used together.

RLIKE Regular Expression Matching

The RLIKE function is similar to the LIKE function, but it allows use of regular expressions (regex). This makes it useful to detect common misspellings, typos, or names with multiple permutations.

In the following example a regex is used to find austin and austen as being likely the same person.

Copy
Copied
SELECT DISTINCT firstname FROM lead WHERE LOWER(firstname) RLIKE 'aust[ie]n$'

fuzzy RLIKE

LEVENSHTEIN Algorithm

The Levenshtein algorithm calculates the distance between two strings using a minimum number of insert, delete, and substitutions needed to get from one string to the other. This algorithm is provided in Hive as the handy LEVENSHTEIN(string,list) function. The levenshtein distance between the two strings is returned.

Copy
Copied
LEVENSHTEIN('user@td.com', 'user@td.com') // returns 0, as they are identicle
LEVENSHTEIN('user@td.com', 'usr@td.com' ) // returns 1, as there is 1 subtraction difference

In the following example the mk_email table is used to find similarities to the string user@treasure-data.com. The result will be returned as a list of the Leveshtein distance calculated and the input email being compared against.

Copy
Copied
SELECT LEVENSHTEIN('user@td.com', mk_email) as lev_dist, mk_email, 'user@td.com'
  input FROM lead WHERE mk_email IS NOT NULL ORDER BY lev_dist LIMIT 100

fuzzy LEVENSHTEIN

The runtime complexity of the Levenshtein algorithm depends on the size of the input (length of the strings) as well as the number of rows.

SOUNDEX Phonemicized Similarity

The soundex function returns the similarity code of the input string.

Copy
Copied
SELECT SOUNDEX('TreasureData')
==>
T626

The soundex function is best used in conjunction with other functions, like the Levenshtein function to find similarities in both text and sound.

Copy
Copied
SELECT  SOUNDEX('michael') AS michael_soundex, 
        SOUNDEX('michele') AS michele_soundex, 
        SOUNDEX('mitchel') AS mitchel_soundex, 
        LEVENSHTEIN('mitchel','michele') AS leven_dist1, 
        LEVENSHTEIN('michael','michele') AS leven_dist2

Note that in this example although both mitchel and michele have a Levenshtein distance of two from michael, michele is definitely closer to michael than mitchel.

Further Reading

For more resources on Fuzzy Matching check out the following resources.

Latest from our blog

Upcoming Evolution of Treasure Data Query Engines

Journey to Containers in Core Services Worker Platform

Sharing our journey with container technology from Worker Platform that manages the fair scheduling of jobs and helps manage the state of individual jobs.

Automatic Customer Segmentation with Machine Learning

Making Your Auto-Segmentation Model Work For You