Fuzzy matching is the process of taking records that are not an exact match and using probabilistic matching to attribute them to the correct ID. Fuzzy matching is most commonly used as part of a set of deduplication rules. This post covers three methods of fuzzy matching in SQL: - `RLIKE` — SQL function that uses regular expressions to do text based matching. - Levenshtein — a Hive function that calculates similarity between two strings - `SOUNDEX` — function that matches based on how similar two strings sound. These methods are often best used together. ## `RLIKE` Regular Expression Matching The `RLIKE` function is similar to the `LIKE` function, but it allows use of regular expressions (regex). This makes it useful to detect common misspellings, typos, or names with multiple permutations. In the following example a regex is used to find `austin` and `austen` as being likely the same person. ```sql SELECT DISTINCT firstname FROM lead WHERE LOWER(firstname) RLIKE 'aust[ie]n$' ``` ![](/assets/fuzzy_rlike.9027b0f05f551d708ad889060b13a96f732756ce1cc13c893c507c4c3b1f7ef3.bc8e4c32.jpg) ## `LEVENSHTEIN` Algorithm The Levenshtein algorithm calculates the distance between two strings using a minimum number of insert, delete, and substitutions needed to get from one string to the other. This algorithm is provided in Hive as the handy `LEVENSHTEIN(string,list)` function. The levenshtein distance between the two strings is returned. ```sql LEVENSHTEIN('user@td.com', 'user@td.com') // returns 0, as they are identicle LEVENSHTEIN('user@td.com', 'usr@td.com' ) // returns 1, as there is 1 subtraction difference ``` In the following example the `mk_email` table is used to find similarities to the string `user@treasure-data.com`. The result will be returned as a list of the Leveshtein distance calculated and the input email being compared against. ```sql SELECT LEVENSHTEIN('user@td.com', mk_email) as lev_dist, mk_email, 'user@td.com' input FROM lead WHERE mk_email IS NOT NULL ORDER BY lev_dist LIMIT 100 ``` ![](/assets/fuzzy_levenshtein.f5a1135e15cd5fb0891667a9ee375a468202b32b44cc1ab372623da1093c3fc9.bc8e4c32.jpg) The runtime complexity of the Levenshtein algorithm depends on the size of the input (length of the strings) as well as the number of rows. ## `SOUNDEX` Phonemicized Similarity The soundex function returns the similarity code of the input string. ```sql SELECT SOUNDEX('TreasureData') ==> T626 ``` The soundex function is best used in conjunction with other functions, like the Levenshtein function to find similarities in both text and sound. ```sql SELECT SOUNDEX('michael') AS michael_soundex, SOUNDEX('michele') AS michele_soundex, SOUNDEX('mitchel') AS mitchel_soundex, LEVENSHTEIN('mitchel','michele') AS leven_dist1, LEVENSHTEIN('michael','michele') AS leven_dist2 ``` Note that in this example although both mitchel and michele have a Levenshtein distance of two from michael, michele is definitely *closer* to michael than mitchel. ## Further Reading For more resources on Fuzzy Matching check out the following resources. - [Fuzzy Matching Workflow](https://github.com/treasure-data/treasure-boxes/tree/master/analytics-box/fuzzy-matching) - step by step implimentation guide with DigDag workflows and sample data.