Multiple Canonical IDs

Up until the previous page, only one canonical_id was created by setting all the keys specified in merge_by_keys:. In this page, we introduce how to configure multiple canonical_ids.

We will use the data from Learning canonical_id for Cookie ID.

Case1: Generating Multiple Canonical IDs Independently

By specifying multiple canonical_ids in canonical_ids:, it is possible to generate multiple independent canonical_ids simultaneously. However, only one canonical_id can be set as the Parent Segment, which is generally the one configured with all keys in merge_by_keys:.

Copy
Copied
name: test_id_unification_ex3

keys:
  - name: td_client_id
    invalid_texts: ['']

  - name: td_global_id
    valid_regexp: "3rd_*"
    invalid_texts: ['']

tables:
  - database: test_id_unification_ex3
    table: site_aaa
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex3
    table: site_xxx
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex3
    table: site_yyy
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex3
    table: site_zzz
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

canonical_ids:
  - name: brand_id_ax
    merge_by_keys: [td_client_id, td_global_id]
    source_tables: ['site_aaa', 'site_xxx']
    merge_iterations: 5

  - name: brand_id_yz
    merge_by_keys: [td_client_id, td_global_id]
    source_tables: ['site_yyy', 'site_zzz']
    merge_iterations: 5

  - name: unified_brand_id
    merge_by_keys: [td_client_id, td_global_id]
    merge_iterations: 5

master_tables:
  - name: master_table_ex3
    canonical_id: unified_brand_id

    attributes:    
      - name: td_client_id
        invalid_texts: ['']
        source_columns:
          - {table: 'site_aaa', order: first, order_by: td_client_id, priority: 1}
          - {table: 'site_xxx', order: first, order_by: td_client_id, priority: 1}
          - {table: 'site_yyy', order: first, order_by: td_client_id, priority: 1}
          - {table: 'site_zzz', order: first, order_by: td_client_id, priority: 1}                    
      - name: td_global_id
        valid_regexp: "3rd_*"
        invalid_texts: ['']
        source_columns:
          - {table: 'site_aaa', order: last, order_by: time, priority: 1}
          - {table: 'site_xxx', order: last, order_by: time, priority: 1}
          - {table: 'site_yyy', order: last, order_by: time, priority: 1}
          - {table: 'site_zzz', order: last, order_by: time, priority: 1}                    

brand_id_ax is created by performing unification on data from only two sites of the same brand, with the goal of assigning a canonical_id specific to this brand. Similarly, brand_id_yz is configured for another brand. Additionally, a unified_brand_id spanning all brands is created and set for the master_table.

In this configuration, the following three canonical_ids will execute their unification algorithms in parallel:

  • brand_id_ax
  • brand_id_yz
  • unified_brand_id

Below are the final graphs for each unification algorithm.

brand_id_ax_graph

canonical_id_1

brand_id_yz_graph

canonical_id_1

unified_brand_id_graph

canonical_id_1

About the Configuration of source_tables:

In this example, source_tables: is specified as follows in the canonical_ids: configuration:

Copy
Copied
canonical_ids:
  - name: brand_id_ax
    merge_by_keys: [td_client_id, td_global_id]
    source_tables: ['site_aaa', 'site_xxx']
    merge_iterations: 5

  - name: brand_id_yz
    merge_by_keys: [td_client_id, td_global_id]
    source_tables: ['site_yyy', 'site_zzz']
    merge_iterations: 5

  - name: unified_brand_id
    merge_by_keys: [td_client_id, td_global_id]
    merge_iterations: 5

In this setup, the unification algorithm uses only the keys from the tables specified in source_tables:. This configuration can be useful when you want to generate canonical_ids for specific brands or games in addition to canonical_ids spanning all data sources.

Case2. Creating Master Tables for Each canonical_id

Next, we will introduce how to create a master_table for each canonical_id. Multiple master_table entries can be enumerated as shown below:

Copy
Copied
master_tables:
  - name: master_table_ex3
    canonical_id: unified_brand_id

    attributes:
      - name: td_client_id
        invalid_texts: ['']
        source_columns:
          - {table: '${td.tbl_aaa}', order: first, order_by: td_client_id, priority: 1}
          - {table: '${td.tbl_xxx}', order: first, order_by: td_client_id, priority: 1}
          - {table: '${td.tbl_yyy}', order: first, order_by: td_client_id, priority: 1}
          - {table: '${td.tbl_zzz}', order: first, order_by: td_client_id, priority: 1}                    
      - name: td_global_id
        valid_regexp: "3rd_*"
        invalid_texts: ['']
        source_columns:
          - {table: '${td.tbl_aaa}', order: last, order_by: time, priority: 1}
          - {table: '${td.tbl_xxx}', order: last, order_by: time, priority: 1}
          - {table: '${td.tbl_yyy}', order: last, order_by: time, priority: 1}
          - {table: '${td.tbl_zzz}', order: last, order_by: time, priority: 1}                    

  - name: master_table_brand_id_ax
    canonical_id: brand_id_ax

    attributes:    
      - name: td_client_id
        invalid_texts: ['']
        source_columns:
          - {table: '${td.tbl_aaa}', order: first, order_by: td_client_id, priority: 1}
          - {table: '${td.tbl_xxx}', order: first, order_by: td_client_id, priority: 1}
      - name: td_global_id
        valid_regexp: "3rd_*"
        invalid_texts: ['']
        source_columns:
          - {table: '${td.tbl_aaa}', order: last, order_by: time, priority: 1}
          - {table: '${td.tbl_xxx}', order: last, order_by: time, priority: 1}

  - name: master_table_brand_id_yz
    canonical_id: brand_id_yz

    attributes:    
      - name: td_client_id
        invalid_texts: ['']
        source_columns:
          - {table: '${td.tbl_yyy}', order: first, order_by: td_client_id, priority: 1}
          - {table: '${td.tbl_zzz}', order: first, order_by: td_client_id, priority: 1}
      - name: td_global_id
        valid_regexp: "3rd_*"
        invalid_texts: ['']
        source_columns:
          - {table: '${td.tbl_yyy}', order: last, order_by: time, priority: 1}
          - {table: '${td.tbl_zzz}', order: last, order_by: time, priority: 1}

master_table

The resulting master_table for each case would look like this:

master_table_ex3

unified_brand_id td_client_id td_global_id
jiioKQC2uMFH aaa_001 3rd_018

master_table_brand_id_ax

brand_id_ax td_client_id td_global_id
jiioKQC2uMFH aaa_001 3rd_005
0_LjXD4ScR1H aaa_002 3rd_010
A0T5jlq77XJH aaa_003 3rd_017

master_table_brand_id_yz

brand_id_yz td_client_id td_global_id
NCnsgr0MusJH yyy_001 3rd_001
UA9xZODmd_pH yyy_002 3rd_008
lb8seLF7YZBH yyy_003 3rd_012
jaUfFmYfVjJH yyy_004 3rd_018
nBTjvWWxqydH zzz_001 3rd_002

enriched_ Table

Even if multiple master_tables are created, only one enriched_table is generated per source table. The key difference from Case1 is that all canonical_ids are enriched into the source tables, allowing linkage with any master_table. However, for brand_id_ax and brand_id_yz specified under source_tables, enrichment is limited to the designated tables. The enriched tables are shown below:

enriched_site_aaa

site td_client_id td_global_id brand_id_ax unified_brand_id
aaa.jp aaa_001 3rd_001 jiioKQC2uMFH jiioKQC2uMFH
aaa.jp aaa_001 3rd_002 jiioKQC2uMFH jiioKQC2uMFH
aaa.jp aaa_001 3rd_003 jiioKQC2uMFH jiioKQC2uMFH
aaa.jp aaa_001 3rd_004 jiioKQC2uMFH jiioKQC2uMFH

Here, the same value appears for both brand_id_ax and unified_brand_id, as both were based on aaa_001, which happened to be the final leader, generating the respective canonical_id. However, the two canonical_ids are not inherently related.

enriched_site_xxx

site td_client_id td_global_id brand_id_ax unified_brand_id time
xxx.jp 1672876800
xxx.jp 1673740800
xxx.jp 1674604800
xxx.jp xxx_001 3rd_004 jiioKQC2uMFH jiioKQC2uMFH 1675555200
xxx.jp xxx_001 3rd_005 jiioKQC2uMFH jiioKQC2uMFH 1676419200

enriched_site_yyy

site td_client_id td_global_id brand_id_yz unified_brand_id time
yyy.jp yyy_001 3rd_001 NCnsgr0MusJH jiioKQC2uMFH 1672876800
yyy.jp 1673740800
yyy.jp 1674604800
yyy.jp 1675555200
yyy.jp yyy_002 3rd_005 UA9xZODmd_pH jiioKQC2uMFH 1676419200

enriched_site_zzz

site td_client_id td_global_id brand_id_yz unified_brand_id time
zzz.jp 1672876800
zzz.jp zzz_001 3rd_002 nBTjvWWxqydH jiioKQC2uMFH 1673740800
zzz.jp 1674604800
zzz.jp 1675555200
zzz.jp 1676419200

Case3. Generating a Unified canonical_id by Merging Multiple canonical_ids

By using merge_by_canonical_ids: in the configuration for each canonical_id within canonical_ids:, you can generate a unified canonical_id by merging other canonical_ids.

Copy
Copied
canonical_ids:
  - name: brand_id_ax
    merge_by_keys: [td_client_id, td_global_id]
    source_tables: ['site_aaa', 'site_xxx']
    merge_iterations: 5

  - name: brand_id_yz
    merge_by_keys: [td_client_id, td_global_id]
    source_tables: ['site_yyy', 'site_zzz']
    merge_iterations: 5

  - name: unified_brand_id
    merge_by_keys: []
    merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
    merge_iterations: 5

In this configuration, unified_brand_id is generated by merging brand_id_ax and brand_id_yz.

Since unified_brand_id is set as the canonical_id for the master_table, it must cover all source tables. Therefore, it is crucial to ensure that the combined source_tables: of brand_id_ax and brand_id_yz cover all necessary tables.

Although the merge_by_keys: for unified_brand_id is empty, the configuration works because merge_by_canonical_ids is specified.

Under this setup, the Unification Algorithm will first process brand_id_ax and brand_id_yz in parallel. Once these are complete, the Unification Algorithm for unified_brand_id will execute.

Unification Algorithm with merge_by_canonical_ids

When merge_by_canonical_ids: is specified, the initial graph of the Unification Algorithm takes on a completely different structure.

In addition to the keys set in merge_by_keys:, the keys from the merge_by_keys: of the canonical IDs specified in merge_by_canonical_ids:, as well as the canonical IDs themselves, are added as keys.

The configuration for unified_brand_id will be used to explain this process.

Copy
Copied
  - name: unified_brand_id
    merge_by_keys: []
    merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
    merge_iterations: 5

unified_brand_id_graph_unify_loop_0

Under this configuration, graph_unify_loop_0 will look as follows:

Despite having an empty merge_by_keys:, all keys from all data sources and the canonical IDs themselves are included.

Creation Rules for graph_unify_loop_0

The initial graph starts by merging the following graphs, meaning that the process begins with a highly interconnected state:

  1. The final graph of brand_id_ax
  2. The final graph of brand_id_yz
  3. The initial graph generated from the combined source tables of brand_id_ax and brand_id_yz

Rules for Determining merge_by_keys:

The applicable merge_by_keys: configuration is determined by the following rules:

  • The canonical IDs specified in merge_by_canonical_ids: are added as stitching keys.
  • Then, merge_by_keys: is determined in the following order of precedence:
    1. Keys in the merge_by_keys: of unified_brand_id have the highest priority.
    2. Canonical IDs from brand_id_ax and brand_id_yz , as specified in merge_by_canonical_ids: , have the next highest priority in the order listed.
    3. The merge_by_keys: settings of brand_id_ax and brand_id_yz (excluding keys already used by unified_brand_id ) have the lowest priority.

In this example, the final merge_by_keys: is as follows:

Copy
Copied
merge_by_keys: [canonical_id (brand_id_ax), canonical_id (brand_id_yz), td_client_id, td_global_id]

The Unification Algorithm is applied to graph_unify_loop_0 using this priority order.

unified_brand_id_graph

canonical_id_1

key canonical_id
04zahkB6hA9G wONJPFlQsJyj

In this configuration, the canonical_id is generated based on canonical_id (brand_id_ax). However, generating a canonical_id based on another canonical_id means that any changes in the base keys of the original canonical_id will also affect this one, adding complexity. To avoid this, consider specifying keys in the merge_by_keys: of unified_brand_id:

Copy
Copied
  - name: unified_brand_id
    merge_by_keys: [td_global_id]
    merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
    merge_iterations: 5

In this case, the true priority of merge_by_keys: becomes:

Copy
Copied
merge_by_keys: [td_global_id, canonical_id (brand_id_ax), canonical_id (brand_id_yz), td_client_id]

The initial graph remains the same, but the final graph will use td_global_id as the leader.

Case4. Setting canonical_id as an Attribute in the master_table

A canonical_id can be set as an attribute in the Parent Segment.

Copy
Copied
canonical_ids:
  - name: brand_id_ax
    merge_by_keys: [td_client_id, td_global_id]    
    source_tables: ['site_aaa','site_xxx']
    merge_iterations: 5

  - name: brand_id_yz
    merge_by_keys: [td_client_id, td_global_id]
    source_tables: ['site_yyy','site_zzz']
    merge_iterations: 5

  - name: unified_brand_id
    merge_by_keys: []
    merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
    merge_iterations: 5

master_tables:
  - name: master_table_ex3
    canonical_id: unified_brand_id

    attributes:    
      - name: brand_id_ax
        source_canonical_id: brand_id_ax

      - name: brand_id_yz
        source_canonical_id: brand_id_yz

      - name: td_client_id
        invalid_texts: ['']
        source_columns:
          - {table: 'site_aaa', order: first, order_by: td_client_id, priority: 1}
          - {table: 'site_xxx', order: first, order_by: td_client_id, priority: 1}
          - {table: 'site_yyy', order: first, order_by: td_client_id, priority: 1}
          - {table: 'site_zzz', order: first, order_by: td_client_id, priority: 1}                    
      - name: td_global_id
        valid_regexp: "3rd_*"
        invalid_texts: ['']
        source_columns:
          - {table: 'site_aaa', order: last, order_by: time, priority: 1}
          - {table: 'site_xxx', order: last, order_by: time, priority: 1}
          - {table: 'site_yyy', order: last, order_by: time, priority: 1}
          - {table: 'site_zzz', order: last, order_by: time, priority: 1}                    

In this configuration, the canonical IDs brand_id_ax and brand_id_yz are set within the attributes: section. Other canonical IDs can also be configured as attributes, but they must always be related to the canonical_id of the master_table (in this case, unified_brand_id) through merge_by_canonical_ids:.

It is also worth reiterating that brand_id_ax and brand_id_yz cannot be set as the canonical ID of the master_table. This is because they do not cover all source tables, which prevents the enriched canonical ID from being applied to all tables.

master_table

The output master_table generates records based only on the number of unified_brand_id specified in canonical_id:. It is important to understand that attributes do not influence this record generation.

As seen in Case1, brand_id_ax and brand_id_yz each contained multiple canonical IDs. However, when referenced as attributes, only the canonical ID corresponding to the most recent key (in terms of the time column) is included.

unified_brand_id td_client_id td_global_id brand_id_ax brand_id_yz
wONJPFlQsJyj aaa_001 3rd_018 XuCLIRp30bxG RXcPcKu6CO1G