Learning canonical_id
for Cookie ID and Email
This page is based on the previous page Learning canonical_id
for Cookie ID, so some duplicate explanations have been omitted; please make sure you have read through the previous page first before continuing.
This page deals with a case where the same person is identified by stitching together the td_client_id
, td_global_id
, td_ssc_id
and email of multiple sites. There are four sites, but each table has a different combination of keys that it has.
site_aaa | site_aaa | site_aaa | site_xxx | site_xxx | site_yyy | site_yyy | site_zzz | site_zzz | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
日付 | td_client_id | td_global_id | td_ssc_id | td_ssc_id | td_global_id | td_ssc_id | td_client_id | ||||||
1月 | 5日 | aaa_001 | 3rd_001 | ssc_001 | a@ex.com | ssc_001 | |||||||
15日 | aaa_001 | 3rd_002 | ssc_001 | zzz_001 | a@ex.com | ||||||||
25日 | aaa_001 | 3rd_003 | ssc_001 | ||||||||||
2月 | 5日 | aaa_001 | 3rd_004 | ssc_001 | ssc_001 | 3rd_004 | |||||||
15日 | ssc_001 | 3rd_005 | a@ex.com | ssc_001 | |||||||||
25日 | a@ex.com | ssc_001 | zzz_003 | a@ex.com | |||||||||
3月 | 5日 | zzz_003 | a@ex.com | ||||||||||
15日 | ssc_001 | 3rd_008 | zzz_003 | a@ex.com | |||||||||
25日 | aaa_002 | 3rd_009 | ssc_002 | ssc_001 | 3rd_009 | ||||||||
4月 | 5日 | aaa_002 | 3rd_010 | ssc_002 | b@ex.com | ssc_003 | |||||||
15日 | b@ex.com | ssc_003 | zzz_004 | b@ex.com | |||||||||
25日 | ssc_002 | 3rd_010 | zzz_004 | c@ex.com | |||||||||
5月 | 5日 | aaa_003 | 3rd_013 | ssc_003 | ssc_002 | 3rd_013 | |||||||
15日 | aaa_003 | 3rd_014 | ssc_003 | ||||||||||
25日 | aaa_003 | 3rd_015 | ssc_004 | c@ex.com | ssc_003 | zzz_005 | c@ex.com | ||||||
6月 | 5日 | aaa_003 | 3rd_016 | ssc_004 | ssc_003 | 3rd_016 | |||||||
15日 | ssc_003 | 3rd_017 | zzz_005 | c@ex.com | |||||||||
25日 | c@ex.com | ssc_004 | zzz_005 | c@ex.com |
Data Preparation
The above data is assumed to be stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using Workflow Samples.
site_aaa
time | site | td_client_id | td_global_id | td_ssc_id |
---|---|---|---|---|
2023/01/05 | aaa.jp | aaa_001 | 3rd_001 | ssc_001 |
2023/01/15 | aaa.jp | aaa_001 | 3rd_002 | ssc_001 |
2023/01/25 | aaa.jp | aaa_001 | 3rd_003 | ssc_001 |
2023/02/05 | aaa.jp | aaa_001 | 3rd_004 | ssc_001 |
2023/02/15 | aaa.jp | |||
2023/02/25 | aaa.jp | |||
2023/03/05 | aaa.jp | |||
2023/03/15 | aaa.jp | |||
2023/03/25 | aaa.jp | aaa_002 | 3rd_009 | ssc_002 |
2023/04/05 | aaa.jp | aaa_002 | 3rd_010 | ssc_002 |
2023/04/15 | aaa.jp | |||
2023/04/25 | aaa.jp | |||
2023/05/05 | aaa.jp | aaa_003 | 3rd_013 | ssc_003 |
2023/05/15 | aaa.jp | aaa_003 | 3rd_014 | ssc_003 |
2023/05/25 | aaa.jp | aaa_003 | 3rd_015 | ssc_004 |
2023/06/05 | aaa.jp | aaa_003 | 3rd_016 | ssc_004 |
2023/06/15 | aaa.jp | |||
2023/06/25 | aaa.jp |
site_xxx
time | site | td_ssc_id | td_global_id |
---|---|---|---|
2023/01/05 | xxx.jp | NULL | NULL |
2023/01/15 | xxx.jp | NULL | NULL |
2023/01/25 | xxx.jp | NULL | NULL |
2023/02/05 | xxx.jp | ssc_001 | 3rd_004 |
2023/02/15 | xxx.jp | ssc_001 | 3rd_005 |
2023/02/25 | xxx.jp | NULL | NULL |
2023/03/05 | xxx.jp | NULL | NULL |
2023/03/15 | xxx.jp | ssc_001 | 3rd_008 |
2023/03/25 | xxx.jp | ssc_001 | 3rd_009 |
2023/04/05 | xxx.jp | NULL | NULL |
2023/04/15 | xxx.jp | NULL | NULL |
2023/04/25 | xxx.jp | ssc_002 | 3rd_010 |
2023/05/05 | xxx.jp | ssc_002 | 3rd_013 |
2023/05/15 | xxx.jp | NULL | NULL |
2023/05/25 | xxx.jp | NULL | NULL |
2023/06/05 | xxx.jp | ssc_003 | 3rd_016 |
2023/06/15 | xxx.jp | ssc_003 | 3rd_017 |
2023/06/25 | xxx.jp | NULL | NULL |
site_yyy
time | site | td_ssc_id | |
---|---|---|---|
2023/01/05 | yyy.jp | a@ex.com | ssc_001 |
2023/01/15 | yyy.jp | NULL | NULL |
2023/01/25 | yyy.jp | NULL | NULL |
2023/02/05 | yyy.jp | NULL | NULL |
2023/02/15 | yyy.jp | a@ex.com | ssc_001 |
2023/02/25 | yyy.jp | a@ex.com | ssc_001 |
2023/03/05 | yyy.jp | NULL | NULL |
2023/03/15 | yyy.jp | NULL | NULL |
2023/03/25 | yyy.jp | NULL | NULL |
2023/04/05 | yyy.jp | b@ex.com | ssc_003 |
2023/04/15 | yyy.jp | b@ex.com | ssc_003 |
2023/04/25 | yyy.jp | NULL | NULL |
2023/05/05 | yyy.jp | NULL | NULL |
2023/05/15 | yyy.jp | NULL | NULL |
2023/05/25 | yyy.jp | c@ex.com | ssc_003 |
2023/06/05 | yyy.jp | NULL | NULL |
2023/06/15 | yyy.jp | NULL | NULL |
2023/06/25 | yyy.jp | c@ex.com | ssc_004 |
site_zzz
time | site | td_client_id | |
---|---|---|---|
2023/01/05 | zzz.jp | NULL | NULL |
2023/01/15 | zzz.jp | zzz_001 | a@ex.com |
2023/01/25 | zzz.jp | NULL | NULL |
2023/02/05 | zzz.jp | NULL | NULL |
2023/02/15 | zzz.jp | NULL | NULL |
2023/02/25 | zzz.jp | zzz_003 | a@ex.com |
2023/03/05 | zzz.jp | zzz_003 | a@ex.com |
2023/03/15 | zzz.jp | zzz_003 | a@ex.com |
2023/03/25 | zzz.jp | NULL | NULL |
2023/04/05 | zzz.jp | NULL | NULL |
2023/04/15 | zzz.jp | zzz_004 | b@ex.com |
2023/04/25 | zzz.jp | zzz_004 | c@ex.com |
2023/05/05 | zzz.jp | NULL | NULL |
2023/05/15 | zzz.jp | NULL | NULL |
2023/05/25 | zzz.jp | zzz_005 | c@ex.com |
2023/06/05 | zzz.jp | NULL | NULL |
2023/06/15 | zzz.jp | zzz_005 | c@ex.com |
2023/06/25 | zzz.jp | zzz_005 | c@ex.com |
id_unification_ex2.dig
+call_unification:
http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
headers:
- authorization: ${secret:td.apikey}
method: POST
retry: true
content_format: json
content:
run_canonical_ids: true
run_enrichments: true
run_master_tables: true
full_refresh: true
keep_debug_tables: true
unification:
!include : unification_ex2.yml
unification_ex2.yml
name: test_id_unification_ex2
keys:
- name: td_client_id
- name: td_global_id
- name: td_ssc_id
- name: email
tables:
- database: test_id_unification_ex2
table: site_aaa
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- {column: td_ssc_id, key: td_ssc_id}
- database: test_id_unification_ex2
table: site_xxx
key_columns:
- {column: td_ssc_id, key: td_ssc_id}
- {column: td_global_id, key: td_global_id}
- database: test_id_unification_ex2
table: site_yyy
key_columns:
- {column: email, key: email}
- {column: td_ssc_id, key: td_ssc_id}
- database: test_id_unification_ex2
table: site_zzz
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: email, key: email}
canonical_ids:
- name: person_id
merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]
merge_iterations: 5
master_tables:
- name: master_table_ex2
canonical_id: person_id
attributes:
- name: email
source_columns:
- {table: site_yyy, order: last, order_by: time, priority: 1}
- {table: site_zzz, order: last, order_by: time, priority: 1}
- name: td_ssc_id
array_elements: 5
source_columns:
- {table: site_xxx, order: last, order_by: time, priority: 1}
- {table: site_yyy, order: last, order_by: time, priority: 2}
- name: td_client_id
array_elements: 5
source_columns:
- {table: site_aaa, order: last, order_by: time, priority: 1}
- {table: site_zzz, order: last, order_by: time, priority: 4}
- name: td_global_id
array_elements: 5
source_columns:
- {table: site_aaa, order: last, order_by: time, priority: 1}
- {table: site_xxx, order: last, order_by: time, priority: 2}
Explanation of the Unification Algorithm
graph_unify_loop_0
The initial state of the graph is as follows:
How the graph_unify_loop_0 Table Is Created
The graph_unify_loop_0
table is created within the +extract_and_merge
task of the workflow. Let's examine how the graph is constructed.
canonical_ids:
- name: person_id
merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]
merge_iterations: 5
In this configuration of canonical_ids:
, merge_by_keys:
prioritizes email
to ensure the creation of the most universal canonical_id
, followed by td_ssc_id
.
Using the original data, the graph_unify_loop_0
table is generated through the same steps as in the previous section.
Unlike before, pairs for site_aaa
containing three or more keys in a single table are created. In this case, for a single record, the leader and followers are as follows:
-
Leader:
-
td_ssc_id
-
-
Followers:
-
td_client_id
-
td_global_id
-
td_ssc_id
-
As a result, the record expands into three pairs for each follower (highlighted in green in the figure above).
follower_ns
and leader_ns
identify which key corresponds to each id. (In this example, 1 is td_client_id
, 2 is td_global_id
, 3 is td_ssc_id
, and 4 is email
.) In this configuration, the priority order of ns
is [4,3,1,2].
graph_unify_loop_1
Based on the graph from loop_0, leaders are updated according to the following rules:
The key difference this time is:
- If a leader is connected to a higher-priority leader, it will be replaced before leaders of the same priority adjacent to the follower.
All leaders are replaced with new_leader
(unless they are the minimum or maximum priority, in which case they remain unchanged). The table resulting from merging all these replacements is the graph_unify_loop_1
table.
Subsequent loops follow the same process with the graph from the previous iteration.
graph_unify_loop_2
graph_unify_loop_3
The process converges in the third loop.
Generating and Assigning canonical_id
In this example, the order of keys specified in merge_by_keys:
prioritizes email
as the highest. This is because, unlike other keys, the value of email
is less likely to change over time. For users without an email, the td_ssc_id
becomes the final leader, and the canonical_id
is generated based on this value.
canonical_id
Configurations for master_table
The canonical_id
used in the master_table
is generally created with all keys specified in merge_by_keys:
. However, canonical_id
can also be created using a subset of the keys.
That said, the canonical_id
configured in the master_table
must meet the following condition:
-
The configuration must ensure that a
canonical_id
is assigned to every record in all source tables. In other words, every source table must have at least one key included inmerge_by_keys:
.
For this example (Example 2 data), the keys in each table are as follows:
-
site_aaa
-
td_client_id
-
td_global_id
-
td_ssc_id
-
-
site_xxx
-
td_ssc_id
-
td_global_id
-
-
site_yyy
-
email
-
td_ssc_id
-
-
site_zzz
-
td_client_id
-
email
-
Given this, the merge_by_keys:
configurations valid for the master_table
are:
-
Good:
[email, td_ssc_id, td_global_id, td_client_id]
-
Good:
[td_ssc_id, td_global_id, td_client_id]
-
Good:
[email, td_global_id]
-
Good:
[td_ssc_id, td_client_id]
On the other hand, the following configurations will result in errors:
-
Bad:
[td_global_id, td_client_id]
-
Missing keys from the
site_yyy
table.
-
Missing keys from the
-
Bad:
[email, td_client_id]
-
Missing keys from the
site_xxx
table.
-
Missing keys from the
In such cases, an error like the following will appear:
400 Bad Request: {"master_tables[0].attributes[0].source_columns[0].table":["is invalid because it doesn't have columns merged into this master_table's canonical_id 'person_id'"],"master_tables[0].attributes[1].sou...
For example, in the [td_global_id, td_client_id]
case, canonical_id
is only assigned to records generated using td_global_id
and td_client_id
. Since the site_yyy
table does not include these keys, it cannot be assigned a canonical_id
. At least one key from merge_by_keys:
(used for stitching) must be present to ensure that every record in all tables is assigned a canonical_id
.