Multiple Canonical IDs
Up until the previous page, only one canonical_id
was created by setting all the keys specified in merge_by_keys:
. In this page, we introduce how to configure multiple canonical_id
s.
We will use the data from Learning canonical_id for Cookie ID.
Case1: Generating Multiple Canonical IDs Independently
By specifying multiple canonical_id
s in canonical_ids:
, it is possible to generate multiple independent canonical_id
s simultaneously. However, only one canonical_id
can be set as the Parent Segment, which is generally the one configured with all keys in merge_by_keys:
.
name: test_id_unification_ex3
keys:
- name: td_client_id
invalid_texts: ['']
- name: td_global_id
valid_regexp: "3rd_*"
invalid_texts: ['']
tables:
- database: test_id_unification_ex3
table: site_aaa
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- database: test_id_unification_ex3
table: site_xxx
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- database: test_id_unification_ex3
table: site_yyy
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- database: test_id_unification_ex3
table: site_zzz
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
canonical_ids:
- name: brand_id_ax
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_aaa', 'site_xxx']
merge_iterations: 5
- name: brand_id_yz
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_yyy', 'site_zzz']
merge_iterations: 5
- name: unified_brand_id
merge_by_keys: [td_client_id, td_global_id]
merge_iterations: 5
master_tables:
- name: master_table_ex3
canonical_id: unified_brand_id
attributes:
- name: td_client_id
invalid_texts: ['']
source_columns:
- {table: 'site_aaa', order: first, order_by: td_client_id, priority: 1}
- {table: 'site_xxx', order: first, order_by: td_client_id, priority: 1}
- {table: 'site_yyy', order: first, order_by: td_client_id, priority: 1}
- {table: 'site_zzz', order: first, order_by: td_client_id, priority: 1}
- name: td_global_id
valid_regexp: "3rd_*"
invalid_texts: ['']
source_columns:
- {table: 'site_aaa', order: last, order_by: time, priority: 1}
- {table: 'site_xxx', order: last, order_by: time, priority: 1}
- {table: 'site_yyy', order: last, order_by: time, priority: 1}
- {table: 'site_zzz', order: last, order_by: time, priority: 1}
brand_id_ax
is created by performing unification on data from only two sites of the same brand, with the goal of assigning a canonical_id
specific to this brand. Similarly, brand_id_yz
is configured for another brand. Additionally, a unified_brand_id
spanning all brands is created and set for the master_table
.
In this configuration, the following three canonical_id
s will execute their unification algorithms in parallel:
-
brand_id_ax
-
brand_id_yz
-
unified_brand_id
Below are the final graphs for each unification algorithm.
brand_id_ax_graph
brand_id_yz_graph
unified_brand_id_graph
About the Configuration of source_tables:
In this example, source_tables:
is specified as follows in the canonical_ids:
configuration:
canonical_ids:
- name: brand_id_ax
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_aaa', 'site_xxx']
merge_iterations: 5
- name: brand_id_yz
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_yyy', 'site_zzz']
merge_iterations: 5
- name: unified_brand_id
merge_by_keys: [td_client_id, td_global_id]
merge_iterations: 5
In this setup, the unification algorithm uses only the keys from the tables specified in source_tables:
. This configuration can be useful when you want to generate canonical_id
s for specific brands or games in addition to canonical_id
s spanning all data sources.
Case2. Creating Master Tables for Each canonical_id
Next, we will introduce how to create a master_table
for each canonical_id
. Multiple master_table
entries can be enumerated as shown below:
master_tables:
- name: master_table_ex3
canonical_id: unified_brand_id
attributes:
- name: td_client_id
invalid_texts: ['']
source_columns:
- {table: '${td.tbl_aaa}', order: first, order_by: td_client_id, priority: 1}
- {table: '${td.tbl_xxx}', order: first, order_by: td_client_id, priority: 1}
- {table: '${td.tbl_yyy}', order: first, order_by: td_client_id, priority: 1}
- {table: '${td.tbl_zzz}', order: first, order_by: td_client_id, priority: 1}
- name: td_global_id
valid_regexp: "3rd_*"
invalid_texts: ['']
source_columns:
- {table: '${td.tbl_aaa}', order: last, order_by: time, priority: 1}
- {table: '${td.tbl_xxx}', order: last, order_by: time, priority: 1}
- {table: '${td.tbl_yyy}', order: last, order_by: time, priority: 1}
- {table: '${td.tbl_zzz}', order: last, order_by: time, priority: 1}
- name: master_table_brand_id_ax
canonical_id: brand_id_ax
attributes:
- name: td_client_id
invalid_texts: ['']
source_columns:
- {table: '${td.tbl_aaa}', order: first, order_by: td_client_id, priority: 1}
- {table: '${td.tbl_xxx}', order: first, order_by: td_client_id, priority: 1}
- name: td_global_id
valid_regexp: "3rd_*"
invalid_texts: ['']
source_columns:
- {table: '${td.tbl_aaa}', order: last, order_by: time, priority: 1}
- {table: '${td.tbl_xxx}', order: last, order_by: time, priority: 1}
- name: master_table_brand_id_yz
canonical_id: brand_id_yz
attributes:
- name: td_client_id
invalid_texts: ['']
source_columns:
- {table: '${td.tbl_yyy}', order: first, order_by: td_client_id, priority: 1}
- {table: '${td.tbl_zzz}', order: first, order_by: td_client_id, priority: 1}
- name: td_global_id
valid_regexp: "3rd_*"
invalid_texts: ['']
source_columns:
- {table: '${td.tbl_yyy}', order: last, order_by: time, priority: 1}
- {table: '${td.tbl_zzz}', order: last, order_by: time, priority: 1}
master_table
The resulting master_table
for each case would look like this:
master_table_ex3
unified_brand_id | td_client_id | td_global_id |
---|---|---|
jiioKQC2uMFH | aaa_001 | 3rd_018 |
master_table_brand_id_ax
brand_id_ax | td_client_id | td_global_id |
---|---|---|
jiioKQC2uMFH | aaa_001 | 3rd_005 |
0_LjXD4ScR1H | aaa_002 | 3rd_010 |
A0T5jlq77XJH | aaa_003 | 3rd_017 |
master_table_brand_id_yz
brand_id_yz | td_client_id | td_global_id |
---|---|---|
NCnsgr0MusJH | yyy_001 | 3rd_001 |
UA9xZODmd_pH | yyy_002 | 3rd_008 |
lb8seLF7YZBH | yyy_003 | 3rd_012 |
jaUfFmYfVjJH | yyy_004 | 3rd_018 |
nBTjvWWxqydH | zzz_001 | 3rd_002 |
enriched_ Table
Even if multiple master_table
s are created, only one enriched_table
is generated per source table. The key difference from Case1 is that all canonical_id
s are enriched into the source tables, allowing linkage with any master_table
. However, for brand_id_ax
and brand_id_yz
specified under source_tables
, enrichment is limited to the designated tables. The enriched tables are shown below:
enriched_site_aaa
site | td_client_id | td_global_id | brand_id_ax | unified_brand_id |
---|---|---|---|---|
aaa.jp | aaa_001 | 3rd_001 | jiioKQC2uMFH | jiioKQC2uMFH |
aaa.jp | aaa_001 | 3rd_002 | jiioKQC2uMFH | jiioKQC2uMFH |
aaa.jp | aaa_001 | 3rd_003 | jiioKQC2uMFH | jiioKQC2uMFH |
aaa.jp | aaa_001 | 3rd_004 | jiioKQC2uMFH | jiioKQC2uMFH |
Here, the same value appears for both brand_id_ax
and unified_brand_id
, as both were based on aaa_001
, which happened to be the final leader, generating the respective canonical_id
. However, the two canonical_id
s are not inherently related.
enriched_site_xxx
site | td_client_id | td_global_id | brand_id_ax | unified_brand_id | time |
---|---|---|---|---|---|
xxx.jp | 1672876800 | ||||
xxx.jp | 1673740800 | ||||
xxx.jp | 1674604800 | ||||
xxx.jp | xxx_001 | 3rd_004 | jiioKQC2uMFH | jiioKQC2uMFH | 1675555200 |
xxx.jp | xxx_001 | 3rd_005 | jiioKQC2uMFH | jiioKQC2uMFH | 1676419200 |
enriched_site_yyy
site | td_client_id | td_global_id | brand_id_yz | unified_brand_id | time |
---|---|---|---|---|---|
yyy.jp | yyy_001 | 3rd_001 | NCnsgr0MusJH | jiioKQC2uMFH | 1672876800 |
yyy.jp | 1673740800 | ||||
yyy.jp | 1674604800 | ||||
yyy.jp | 1675555200 | ||||
yyy.jp | yyy_002 | 3rd_005 | UA9xZODmd_pH | jiioKQC2uMFH | 1676419200 |
enriched_site_zzz
site | td_client_id | td_global_id | brand_id_yz | unified_brand_id | time |
---|---|---|---|---|---|
zzz.jp | 1672876800 | ||||
zzz.jp | zzz_001 | 3rd_002 | nBTjvWWxqydH | jiioKQC2uMFH | 1673740800 |
zzz.jp | 1674604800 | ||||
zzz.jp | 1675555200 | ||||
zzz.jp | 1676419200 |
Case3. Generating a Unified canonical_id
by Merging Multiple canonical_id
s
By using merge_by_canonical_ids:
in the configuration for each canonical_id
within canonical_ids:
, you can generate a unified canonical_id
by merging other canonical_id
s.
canonical_ids:
- name: brand_id_ax
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_aaa', 'site_xxx']
merge_iterations: 5
- name: brand_id_yz
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_yyy', 'site_zzz']
merge_iterations: 5
- name: unified_brand_id
merge_by_keys: []
merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
merge_iterations: 5
In this configuration, unified_brand_id
is generated by merging brand_id_ax
and brand_id_yz
.
Since unified_brand_id
is set as the canonical_id
for the master_table
, it must cover all source tables. Therefore, it is crucial to ensure that the combined source_tables:
of brand_id_ax
and brand_id_yz
cover all necessary tables.
Although the merge_by_keys:
for unified_brand_id
is empty, the configuration works because merge_by_canonical_ids
is specified.
Under this setup, the Unification Algorithm will first process brand_id_ax
and brand_id_yz
in parallel. Once these are complete, the Unification Algorithm for unified_brand_id
will execute.
Unification Algorithm with merge_by_canonical_ids
When merge_by_canonical_ids:
is specified, the initial graph of the Unification Algorithm takes on a completely different structure.
In addition to the keys set in merge_by_keys:
, the keys from the merge_by_keys:
of the canonical IDs specified in merge_by_canonical_ids:
, as well as the canonical IDs themselves, are added as keys.
The configuration for unified_brand_id
will be used to explain this process.
- name: unified_brand_id
merge_by_keys: []
merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
merge_iterations: 5
unified_brand_id_graph_unify_loop_0
Under this configuration, graph_unify_loop_0
will look as follows:
Despite having an empty merge_by_keys:
, all keys from all data sources and the canonical IDs themselves are included.
Creation Rules for graph_unify_loop_0
The initial graph starts by merging the following graphs, meaning that the process begins with a highly interconnected state:
-
The final graph of
brand_id_ax
-
The final graph of
brand_id_yz
-
The initial graph generated from the combined source tables of
brand_id_ax
andbrand_id_yz
Rules for Determining merge_by_keys:
The applicable merge_by_keys:
configuration is determined by the following rules:
-
The canonical IDs specified in
merge_by_canonical_ids:
are added as stitching keys. -
Then,
merge_by_keys:
is determined in the following order of precedence:-
Keys in the
merge_by_keys:
ofunified_brand_id
have the highest priority. -
Canonical IDs from
brand_id_ax
andbrand_id_yz
, as specified inmerge_by_canonical_ids:
, have the next highest priority in the order listed. -
The
merge_by_keys:
settings ofbrand_id_ax
andbrand_id_yz
(excluding keys already used byunified_brand_id
) have the lowest priority.
-
Keys in the
In this example, the final merge_by_keys:
is as follows:
merge_by_keys: [canonical_id (brand_id_ax), canonical_id (brand_id_yz), td_client_id, td_global_id]
The Unification Algorithm is applied to graph_unify_loop_0
using this priority order.
unified_brand_id_graph
key | canonical_id |
---|---|
04zahkB6hA9G | wONJPFlQsJyj |
In this configuration, the canonical_id
is generated based on canonical_id (brand_id_ax)
. However, generating a canonical_id
based on another canonical_id
means that any changes in the base keys of the original canonical_id
will also affect this one, adding complexity. To avoid this, consider specifying keys in the merge_by_keys:
of unified_brand_id
:
- name: unified_brand_id
merge_by_keys: [td_global_id]
merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
merge_iterations: 5
In this case, the true priority of merge_by_keys:
becomes:
merge_by_keys: [td_global_id, canonical_id (brand_id_ax), canonical_id (brand_id_yz), td_client_id]
The initial graph remains the same, but the final graph will use td_global_id
as the leader.
Case4. Setting canonical_id
as an Attribute in the master_table
A canonical_id
can be set as an attribute in the Parent Segment.
canonical_ids:
- name: brand_id_ax
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_aaa','site_xxx']
merge_iterations: 5
- name: brand_id_yz
merge_by_keys: [td_client_id, td_global_id]
source_tables: ['site_yyy','site_zzz']
merge_iterations: 5
- name: unified_brand_id
merge_by_keys: []
merge_by_canonical_ids: [brand_id_ax, brand_id_yz]
merge_iterations: 5
master_tables:
- name: master_table_ex3
canonical_id: unified_brand_id
attributes:
- name: brand_id_ax
source_canonical_id: brand_id_ax
- name: brand_id_yz
source_canonical_id: brand_id_yz
- name: td_client_id
invalid_texts: ['']
source_columns:
- {table: 'site_aaa', order: first, order_by: td_client_id, priority: 1}
- {table: 'site_xxx', order: first, order_by: td_client_id, priority: 1}
- {table: 'site_yyy', order: first, order_by: td_client_id, priority: 1}
- {table: 'site_zzz', order: first, order_by: td_client_id, priority: 1}
- name: td_global_id
valid_regexp: "3rd_*"
invalid_texts: ['']
source_columns:
- {table: 'site_aaa', order: last, order_by: time, priority: 1}
- {table: 'site_xxx', order: last, order_by: time, priority: 1}
- {table: 'site_yyy', order: last, order_by: time, priority: 1}
- {table: 'site_zzz', order: last, order_by: time, priority: 1}
In this configuration, the canonical IDs brand_id_ax
and brand_id_yz
are set within the attributes:
section. Other canonical IDs can also be configured as attributes, but they must always be related to the canonical_id
of the master_table
(in this case, unified_brand_id
) through merge_by_canonical_ids:
.
It is also worth reiterating that brand_id_ax
and brand_id_yz
cannot be set as the canonical ID of the master_table
. This is because they do not cover all source tables, which prevents the enriched canonical ID from being applied to all tables.
master_table
The output master_table
generates records based only on the number of unified_brand_id
specified in canonical_id:
. It is important to understand that attributes do not influence this record generation.
As seen in Case1, brand_id_ax
and brand_id_yz
each contained multiple canonical IDs. However, when referenced as attributes, only the canonical ID corresponding to the most recent key (in terms of the time
column) is included.
unified_brand_id | td_client_id | td_global_id | brand_id_ax | brand_id_yz |
---|---|---|---|---|
wONJPFlQsJyj | aaa_001 | 3rd_018 | XuCLIRp30bxG | RXcPcKu6CO1G |