What is do_no_merge_key?
The do_not_merge_key:
option applies stitching rules that differ from the traditional Unification stitching rules:
-
If the values of the key specified in
do_not_merge_key:
are different, records must not be stitched together (even if other keys link them). - Records without this key's value can be stitched together using other keys.
When to Use do_not_merge_key
-
You have a unique key (e.g.,
member_id
,customer_id
) to ensure personal uniqueness. -
For records lacking this key, you want to stitch based on secondary keys (e.g.,
email
,tel
) wherever possible.
By setting do_not_merge_key: member_id
, records with distinct member_id
s remain separate. For records without member_id
, stitching proceeds using the available keys.
Let's deepen our understanding with specific examples.
Dataset
This page use a single, simple table for this example.
site_aaa
member_id | tel | name | |
---|---|---|---|
1 | a@ex.com | 1111 | Taka |
2 | a@ex.com | 2222 | Tatsuo |
3 | b@ex.com | 3333 | Naruse |
3 | b@ex.com | 4444 | Yuichiro |
NULL | c@ex.com | 5555 | Minero |
NULL | c@ex.com | 6666 | Kaz |
id_unification_ex4.dig
First, let's introduce the Workflow (WF) to be executed in this section.
+call_unification:
http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
headers:
- authorization: ${secret:td.apikey}
method: POST
retry: true
content_format: json
content:
run_canonical_ids: true
run_enrichments: true
run_master_tables: true
full_refresh: true
keep_debug_tables: true
unification:
!include : unification_ex4.yml
unification_ex4.yml
name: test_id_unification_ex4
keys:
- name: member_id
- name: email
tables:
- database: test_id_unification_ex4
table: site_aaa
key_columns:
- {column: member_id, key: member_id}
- {column: email, key: email}
canonical_ids:
- name: person_id
merge_by_keys: [member_id, email]
merge_iterations: 3
do_not_merge_key: member_id
master_tables:
- name: master_table_ex4
canonical_id: person_id
attributes:
- name: member_id
source_columns:
- {table: site_aaa, priority: 1}
- name: email
source_columns:
- {table: site_aaa, priority: 1}
- name: tel
array_elements: 2
source_columns:
- {table: site_aaa, priority: 1}
- name: name
array_elements: 2
source_columns:
- {table: site_aaa, priority: 1}
Cases Where Default ID Unification Falls Short
For example, consider scenarios where the email used as a key includes shared addresses like mailing lists or family emails. In such cases, linking data based solely on email might mistakenly associate different individuals.
- However, removing email as a key would hinder proper stitching.
-
On the other hand, there may also be keys like
member_id
that ensure personal uniqueness.
Here’s a potential stitching approach you might consider:
-
Prioritize stitching based on the values of
member_id
wherever possible. -
If no
member_id
value exists, attempt stitching using email as a secondary key.
Let’s revisit the table below.
member_id | tel | name | |
---|---|---|---|
1 | a@ex.com | 1111 | Taka |
2 | a@ex.com | 2222 | Tatsuo |
3 | b@ex.com | 3333 | Naruse |
3 | b@ex.com | 4444 | Yuichiro |
NULL | c@ex.com | 5555 | Minero |
NULL | c@ex.com | 6666 | Kaz |
Given this data:
-
Rows 1 and 2 share the same email but different
member_id
s, so they shouldn’t be merged. -
Rows 3 and 4 share the same
member_id
and should be merged. -
Rows 5 and 6 lack
member_id
, so they should be merged based on email.
This ideal stitching behavior could be achieved using a specific unification option: do_not_merge_key
.
Without Using do_not_merge_key
Let’s consider a conventional approach without do_not_merge_key
by configuring canonical_ids
as shown below:
canonical_ids:
- name: person_id
merge_by_keys: [member_id, email]
merge_iterations: 3
The output (enriched_site_aaa
table) wouldn’t meet expectations. Rows 1 and 2, despite having different member_id
s, are treated as the same individual because they share the same email. This happens even though their shared email is merely a mailing list address.
member_id | tel | name | canonical_id | |
---|---|---|---|---|
1 | a@ex.com | 1111 | Taka | 4ydklKlyPnfa |
2 | a@ex.com | 2222 | Tatsuo | 4ydklKlyPnfa |
3 | b@ex.com | 3333 | Naruse | xqaWYjT4GR3a |
3 | b@ex.com | 4444 | Yuichiro | xqaWYjT4GR3a |
NULL | c@ex.com | 5555 | Minero | NEKDReELMAx |
NULL | c@ex.com | 6666 | Kaz | NEKDReELMAx |
How can we ensure rows 1 and 2 are treated as different individuals?
Using do_not_merge_key
Next, let’s configure canonical_ids
with the do_not_merge_key
option and execute the unification:
canonical_ids:
- name: person_id
merge_by_keys: [member_id, email]
merge_iterations: 3
do_not_merge_key: member_id
This results in the desired output where rows 1 and 2 are treated separately, while rows 5 and 6 are merged based on email.
member_id | tel | name | canonical_id | |
---|---|---|---|---|
1 | a@ex.com | 1111 | Taka | 4ydklKlyPnfa |
2 | a@ex.com | 2222 | Tatsuo | XNKI3XAY1Hja |
3 | b@ex.com | 3333 | Naruse | xqaWYjT4GR3a |
3 | b@ex.com | 4444 | Yuichiro | xqaWYjT4GR3a |
NULL | c@ex.com | 5555 | Minero | NEKDReELMAx |
NULL | c@ex.com | 6666 | Kaz | NEKDReELMAx |
The master_table
output is as follows:
person_id | member_id | tel | name | |
---|---|---|---|---|
58MMqWNFdAhu | 1 | a@ex.com | [1111] | ["Taka"] |
WDbg4Lovngdu | 2 | a@ex.com | [2222] | ["Tatsuo"] |
wkL-X_7PU2Ju | 3 | b@ex.com | [4444, 3333] | ["Yuichiro", "Naruse"] |
UVs_qHcWwZnz | NULL | c@ex.com | [6666, 5555] | ["Kaz", "Minero"] |
Explanation of Unification Algorithm
The Unification Algorithm for this case is explained.
graph_unify_loop_0 (1,2,3,graph also)
In fact, the initial graph is the final state this time.
In the conventional Algorithm, "leader_id: 2" should be replaced (merged) by "leader_id: 1" , but do_not_merge_key:
is set to member_id, so that it is not merged. So, no further changes are made.
Generation and assignment of canonical_id
In this case, four different canonical_ids
will be generated.
The difference is that "follower: a@ex.com" points to two leaders, "member_id: 1" and "member_id: 2," regardless of the state of convergence.
In fact, there are 2 rows of records for a@ex.com in the lookup table.
id | canonical_id |
---|---|
1 | 4ydklKlyPnfa |
2 | XNKI3XAY1Hja |
3 | xqaWYjT4GR3a |
a@ex.com | 4ydklKlyPnfa |
a@ex.com | XNKI3XAY1Hja |
b@ex.com | xqaWYjT4GR3a |
c@ex.com | NEKDReELMAxH |
This means that the algorithm has converged without "all the followers point to only one leader," which is one of the methods to determine algorithm convergence. However, the other method, "checking the output of the +report_diff
task in WF Logs" is still valid.
Cases where do_not_merge_key should be used
Finally, let's review the cases in which do_not_merge_key should be used. 1.
- (Assumption) You have one key whose uniqueness is guaranteed (e.g. member _ id, customer _ id, etc.).
- for records that do not have the value of key 1., we want to proceed with stitching as much as possible, using a key whose uniqueness is not partially guaranteed (e.g., email, tel, etc.).
In this case, if you set do_not_merge_key: member_id
, the records will be judged as different persons even if other keys are stitched with the same value, as long as member_id is different. However, for records that do not have member_id, other keys are stitched together to the maximum extent possible.