Fuzzy Matching

Fuzzy matching is the process of taking records that are not an exact match and using probabilistic matching to attribute them to the correct ID. Fuzzy matching is most commonly used as part of a set of deduplication rules.

This post covers three methods of fuzzy matching in SQL:

RLIKE — SQL function that uses regular expressions to do text based matching.
Levenshtein — a Hive function that calculates similarity between two strings
SOUNDEX — function that matches based on how similar two strings sound.

These methods are often best used together.

`RLIKE` Regular Expression Matching

The RLIKE function is similar to the LIKE function, but it allows use of regular expressions (regex). This makes it useful to detect common misspellings, typos, or names with multiple permutations.

In the following example a regex is used to find austin and austen as being likely the same person.

Copy

Copied

SELECT DISTINCT firstname FROM lead WHERE LOWER(firstname) RLIKE 'aust[ie]n$'

fuzzy RLIKE

`LEVENSHTEIN` Algorithm

The Levenshtein algorithm calculates the distance between two strings using a minimum number of insert, delete, and substitutions needed to get from one string to the other. This algorithm is provided in Hive as the handy LEVENSHTEIN(string,list) function. The levenshtein distance between the two strings is returned.

Copy

Copied

LEVENSHTEIN('user@td.com', 'user@td.com') // returns 0, as they are identicle
LEVENSHTEIN('user@td.com', 'usr@td.com' ) // returns 1, as there is 1 subtraction difference

In the following example the mk_email table is used to find similarities to the string user@treasure-data.com. The result will be returned as a list of the Leveshtein distance calculated and the input email being compared against.

Copy

Copied

SELECT LEVENSHTEIN('user@td.com', mk_email) as lev_dist, mk_email, 'user@td.com'
  input FROM lead WHERE mk_email IS NOT NULL ORDER BY lev_dist LIMIT 100

fuzzy LEVENSHTEIN

The runtime complexity of the Levenshtein algorithm depends on the size of the input (length of the strings) as well as the number of rows.

`SOUNDEX` Phonemicized Similarity

The soundex function returns the similarity code of the input string.

Copy

Copied

SELECT SOUNDEX('TreasureData')
==>
T626

The soundex function is best used in conjunction with other functions, like the Levenshtein function to find similarities in both text and sound.

Copy

Copied

SELECT  SOUNDEX('michael') AS michael_soundex, 
        SOUNDEX('michele') AS michele_soundex, 
        SOUNDEX('mitchel') AS mitchel_soundex, 
        LEVENSHTEIN('mitchel','michele') AS leven_dist1, 
        LEVENSHTEIN('michael','michele') AS leven_dist2

Note that in this example although both mitchel and michele have a Levenshtein distance of two from michael, michele is definitely closer to michael than mitchel.

Fuzzy Matching

`RLIKE` Regular Expression Matching

`LEVENSHTEIN` Algorithm

`SOUNDEX` Phonemicized Similarity

Further Reading

Latest from our blog

Leveraging Query ReExecution for Smooth Hive 4 Migration

Orchestrate dbt with Treasure Workflow Episode 2

Upcoming Evolution of Treasure Data Query Engines

Fuzzy Matching

RLIKE Regular Expression Matching

LEVENSHTEIN Algorithm

SOUNDEX Phonemicized Similarity

Further Reading

Latest from our blog

Leveraging Query ReExecution for Smooth Hive 4 Migration

Orchestrate dbt with Treasure Workflow Episode 2

Upcoming Evolution of Treasure Data Query Engines

`RLIKE` Regular Expression Matching

`LEVENSHTEIN` Algorithm

`SOUNDEX` Phonemicized Similarity