Fuzzy matching is the process of taking records that are not an exact match and using probabilistic matching to attribute them to the correct ID. Fuzzy matching is most commonly used as part of a set of deduplication rules.
This post covers three methods of fuzzy matching in SQL:
-
RLIKE
— SQL function that uses regular expressions to do text based matching. - Levenshtein — a Hive function that calculates similarity between two strings
-
SOUNDEX
— function that matches based on how similar two strings sound.
These methods are often best used together.
RLIKE
Regular Expression Matching
The RLIKE
function is similar to the LIKE
function, but it allows use of regular expressions (regex). This makes it useful to detect common misspellings, typos, or names with multiple permutations.
In the following example a regex is used to find austin
and austen
as being likely the same person.
SELECT DISTINCT firstname FROM lead WHERE LOWER(firstname) RLIKE 'aust[ie]n$'
LEVENSHTEIN
Algorithm
The Levenshtein algorithm calculates the distance between two strings using a minimum number of insert, delete, and substitutions needed to get from one string to the other. This algorithm is provided in Hive as the handy LEVENSHTEIN(string,list)
function. The levenshtein distance between the two strings is returned.
LEVENSHTEIN('user@td.com', 'user@td.com') // returns 0, as they are identicle
LEVENSHTEIN('user@td.com', 'usr@td.com' ) // returns 1, as there is 1 subtraction difference
In the following example the mk_email
table is used to find similarities to the string user@treasure-data.com
. The result will be returned as a list of the Leveshtein distance calculated and the input email being compared against.
SELECT LEVENSHTEIN('user@td.com', mk_email) as lev_dist, mk_email, 'user@td.com'
input FROM lead WHERE mk_email IS NOT NULL ORDER BY lev_dist LIMIT 100
The runtime complexity of the Levenshtein algorithm depends on the size of the input (length of the strings) as well as the number of rows.
SOUNDEX
Phonemicized Similarity
The soundex function returns the similarity code of the input string.
SELECT SOUNDEX('TreasureData')
==>
T626
The soundex function is best used in conjunction with other functions, like the Levenshtein function to find similarities in both text and sound.
SELECT SOUNDEX('michael') AS michael_soundex,
SOUNDEX('michele') AS michele_soundex,
SOUNDEX('mitchel') AS mitchel_soundex,
LEVENSHTEIN('mitchel','michele') AS leven_dist1,
LEVENSHTEIN('michael','michele') AS leven_dist2
Note that in this example although both mitchel and michele have a Levenshtein distance of two from michael, michele is definitely closer to michael than mitchel.
Further Reading
For more resources on Fuzzy Matching check out the following resources.
- Fuzzy Matching Workflow - step by step implimentation guide with DigDag workflows and sample data.