Learning canonical_id for Cookie ID and Email

This page is based on the previous page Learning canonical_id for Cookie ID, so some duplicate explanations have been omitted; please make sure you have read through the previous page first before continuing.

This page deals with a case where the same person is identified by stitching together the td_client_id, td_global_id, td_ssc_id and email of multiple sites. There are four sites, but each table has a different combination of keys that it has.

site_aaa site_aaa site_aaa site_xxx site_xxx site_yyy site_yyy site_zzz site_zzz
日付 td_client_id td_global_id td_ssc_id td_ssc_id td_global_id email td_ssc_id td_client_id email
1月 5日 aaa_001 3rd_001 ssc_001 a@ex.com ssc_001
15日 aaa_001 3rd_002 ssc_001 zzz_001 a@ex.com
25日 aaa_001 3rd_003 ssc_001
2月 5日 aaa_001 3rd_004 ssc_001 ssc_001 3rd_004
15日 ssc_001 3rd_005 a@ex.com ssc_001
25日 a@ex.com ssc_001 zzz_003 a@ex.com
3月 5日 zzz_003 a@ex.com
15日 ssc_001 3rd_008 zzz_003 a@ex.com
25日 aaa_002 3rd_009 ssc_002 ssc_001 3rd_009
4月 5日 aaa_002 3rd_010 ssc_002 b@ex.com ssc_003
15日 b@ex.com ssc_003 zzz_004 b@ex.com
25日 ssc_002 3rd_010 zzz_004 c@ex.com
5月 5日 aaa_003 3rd_013 ssc_003 ssc_002 3rd_013
15日 aaa_003 3rd_014 ssc_003
25日 aaa_003 3rd_015 ssc_004 c@ex.com ssc_003 zzz_005 c@ex.com
6月 5日 aaa_003 3rd_016 ssc_004 ssc_003 3rd_016
15日 ssc_003 3rd_017 zzz_005 c@ex.com
25日 c@ex.com ssc_004 zzz_005 c@ex.com

Data Preparation

The above data is assumed to be stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using Workflow Samples.

site_aaa

time site td_client_id td_global_id td_ssc_id
2023/01/05 aaa.jp aaa_001 3rd_001 ssc_001
2023/01/15 aaa.jp aaa_001 3rd_002 ssc_001
2023/01/25 aaa.jp aaa_001 3rd_003 ssc_001
2023/02/05 aaa.jp aaa_001 3rd_004 ssc_001
2023/02/15 aaa.jp
2023/02/25 aaa.jp
2023/03/05 aaa.jp
2023/03/15 aaa.jp
2023/03/25 aaa.jp aaa_002 3rd_009 ssc_002
2023/04/05 aaa.jp aaa_002 3rd_010 ssc_002
2023/04/15 aaa.jp
2023/04/25 aaa.jp
2023/05/05 aaa.jp aaa_003 3rd_013 ssc_003
2023/05/15 aaa.jp aaa_003 3rd_014 ssc_003
2023/05/25 aaa.jp aaa_003 3rd_015 ssc_004
2023/06/05 aaa.jp aaa_003 3rd_016 ssc_004
2023/06/15 aaa.jp
2023/06/25 aaa.jp

site_xxx

time site td_ssc_id td_global_id
2023/01/05 xxx.jp NULL NULL
2023/01/15 xxx.jp NULL NULL
2023/01/25 xxx.jp NULL NULL
2023/02/05 xxx.jp ssc_001 3rd_004
2023/02/15 xxx.jp ssc_001 3rd_005
2023/02/25 xxx.jp NULL NULL
2023/03/05 xxx.jp NULL NULL
2023/03/15 xxx.jp ssc_001 3rd_008
2023/03/25 xxx.jp ssc_001 3rd_009
2023/04/05 xxx.jp NULL NULL
2023/04/15 xxx.jp NULL NULL
2023/04/25 xxx.jp ssc_002 3rd_010
2023/05/05 xxx.jp ssc_002 3rd_013
2023/05/15 xxx.jp NULL NULL
2023/05/25 xxx.jp NULL NULL
2023/06/05 xxx.jp ssc_003 3rd_016
2023/06/15 xxx.jp ssc_003 3rd_017
2023/06/25 xxx.jp NULL NULL

site_yyy

time site email td_ssc_id
2023/01/05 yyy.jp a@ex.com ssc_001
2023/01/15 yyy.jp NULL NULL
2023/01/25 yyy.jp NULL NULL
2023/02/05 yyy.jp NULL NULL
2023/02/15 yyy.jp a@ex.com ssc_001
2023/02/25 yyy.jp a@ex.com ssc_001
2023/03/05 yyy.jp NULL NULL
2023/03/15 yyy.jp NULL NULL
2023/03/25 yyy.jp NULL NULL
2023/04/05 yyy.jp b@ex.com ssc_003
2023/04/15 yyy.jp b@ex.com ssc_003
2023/04/25 yyy.jp NULL NULL
2023/05/05 yyy.jp NULL NULL
2023/05/15 yyy.jp NULL NULL
2023/05/25 yyy.jp c@ex.com ssc_003
2023/06/05 yyy.jp NULL NULL
2023/06/15 yyy.jp NULL NULL
2023/06/25 yyy.jp c@ex.com ssc_004

site_zzz

time site td_client_id email
2023/01/05 zzz.jp NULL NULL
2023/01/15 zzz.jp zzz_001 a@ex.com
2023/01/25 zzz.jp NULL NULL
2023/02/05 zzz.jp NULL NULL
2023/02/15 zzz.jp NULL NULL
2023/02/25 zzz.jp zzz_003 a@ex.com
2023/03/05 zzz.jp zzz_003 a@ex.com
2023/03/15 zzz.jp zzz_003 a@ex.com
2023/03/25 zzz.jp NULL NULL
2023/04/05 zzz.jp NULL NULL
2023/04/15 zzz.jp zzz_004 b@ex.com
2023/04/25 zzz.jp zzz_004 c@ex.com
2023/05/05 zzz.jp NULL NULL
2023/05/15 zzz.jp NULL NULL
2023/05/25 zzz.jp zzz_005 c@ex.com
2023/06/05 zzz.jp NULL NULL
2023/06/15 zzz.jp zzz_005 c@ex.com
2023/06/25 zzz.jp zzz_005 c@ex.com

id_unification_ex2.dig

Copy
Copied
+call_unification:
  http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call  
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:

    run_canonical_ids: true
    run_enrichments: true
    run_master_tables: true

    full_refresh: true
    keep_debug_tables: true

    unification:
      !include : unification_ex2.yml

unification_ex2.yml

Copy
Copied
name: test_id_unification_ex2

keys:
  - name: td_client_id
  - name: td_global_id
  - name: td_ssc_id
  - name: email

tables:
  - database: test_id_unification_ex2
    table: site_aaa
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}
      - {column: td_ssc_id, key: td_ssc_id}

  - database: test_id_unification_ex2
    table: site_xxx
    key_columns:
      - {column: td_ssc_id, key: td_ssc_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex2
    table: site_yyy
    key_columns:
      - {column: email, key: email}
      - {column: td_ssc_id, key: td_ssc_id}

  - database: test_id_unification_ex2
    table: site_zzz
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: email, key: email}

canonical_ids:
  - name: person_id
    merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]    
    merge_iterations: 5

master_tables:
  - name: master_table_ex2
    canonical_id: person_id
    attributes:
      - name: email
        source_columns:
          - {table: site_yyy, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 1}                    
      - name: td_ssc_id
        array_elements: 5
        source_columns:
          - {table: site_xxx, order: last, order_by: time, priority: 1}
          - {table: site_yyy, order: last, order_by: time, priority: 2}
      - name: td_client_id
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 4}                    
      - name: td_global_id
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_xxx, order: last, order_by: time, priority: 2}

Explanation of the Unification Algorithm

graph_unify_loop_0

The initial state of the graph is as follows:

canonical_id_1

How the graph_unify_loop_0 Table Is Created

The graph_unify_loop_0 table is created within the +extract_and_merge task of the workflow. Let's examine how the graph is constructed.

Copy
Copied
canonical_ids:
  - name: person_id
    merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]    
    merge_iterations: 5

In this configuration of canonical_ids:, merge_by_keys: prioritizes email to ensure the creation of the most universal canonical_id, followed by td_ssc_id.

Using the original data, the graph_unify_loop_0 table is generated through the same steps as in the previous section.

canonical_id_1

Unlike before, pairs for site_aaa containing three or more keys in a single table are created. In this case, for a single record, the leader and followers are as follows:

  • Leader:
    • td_ssc_id
  • Followers:
    • td_client_id
    • td_global_id
    • td_ssc_id

As a result, the record expands into three pairs for each follower (highlighted in green in the figure above).

follower_ns and leader_ns identify which key corresponds to each id. (In this example, 1 is td_client_id, 2 is td_global_id, 3 is td_ssc_id, and 4 is email.) In this configuration, the priority order of ns is [4,3,1,2].

graph_unify_loop_1

Based on the graph from loop_0, leaders are updated according to the following rules:

canonical_id_1

The key difference this time is:

  • If a leader is connected to a higher-priority leader, it will be replaced before leaders of the same priority adjacent to the follower.

All leaders are replaced with new_leader (unless they are the minimum or maximum priority, in which case they remain unchanged). The table resulting from merging all these replacements is the graph_unify_loop_1 table.

canonical_id_1

canonical_id_1

Subsequent loops follow the same process with the graph from the previous iteration.

graph_unify_loop_2

canonical_id_1

graph_unify_loop_3

The process converges in the third loop.

canonical_id_1

Generating and Assigning canonical_id

In this example, the order of keys specified in merge_by_keys: prioritizes email as the highest. This is because, unlike other keys, the value of email is less likely to change over time. For users without an email, the td_ssc_id becomes the final leader, and the canonical_id is generated based on this value.

canonical_id_1

canonical_id Configurations for master_table

The canonical_id used in the master_table is generally created with all keys specified in merge_by_keys:. However, canonical_id can also be created using a subset of the keys.

That said, the canonical_id configured in the master_table must meet the following condition:

  • The configuration must ensure that a canonical_id is assigned to every record in all source tables. In other words, every source table must have at least one key included in merge_by_keys: .

For this example (Example 2 data), the keys in each table are as follows:

  • site_aaa
    • td_client_id
    • td_global_id
    • td_ssc_id
  • site_xxx
    • td_ssc_id
    • td_global_id
  • site_yyy
    • email
    • td_ssc_id
  • site_zzz
    • td_client_id
    • email

Given this, the merge_by_keys: configurations valid for the master_table are:

  • Good: [email, td_ssc_id, td_global_id, td_client_id]
  • Good: [td_ssc_id, td_global_id, td_client_id]
  • Good: [email, td_global_id]
  • Good: [td_ssc_id, td_client_id]

On the other hand, the following configurations will result in errors:

  • Bad: [td_global_id, td_client_id]
    • Missing keys from the site_yyy table.
  • Bad: [email, td_client_id]
    • Missing keys from the site_xxx table.

In such cases, an error like the following will appear:

Copy
Copied
400 Bad Request: {"master_tables[0].attributes[0].source_columns[0].table":["is invalid because it doesn't have columns merged into this master_table's canonical_id 'person_id'"],"master_tables[0].attributes[1].sou...

For example, in the [td_global_id, td_client_id] case, canonical_id is only assigned to records generated using td_global_id and td_client_id. Since the site_yyy table does not include these keys, it cannot be assigned a canonical_id. At least one key from merge_by_keys: (used for stitching) must be present to ensure that every record in all tables is assigned a canonical_id.