Files Required for ID Unification Workflow
To set up a Unification Workflow, the following two files are essential:
- Dig file : Used to invoke the Unification Workflow.
- YAML file : Defines the data sources and stitching keys used for ID unification.
This document provides detailed instructions on how to write these files. It includes all available options, so refer to this guide if you're uncertain about specific configurations.
Writing a dig File
The following example serves as a template for creating a dig
file. The file primarily sets options for running the Unification Workflow (WF). You can use this as a basic template for your implementation.
+call_unification:
http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
headers:
- authorization: ${secret:td.apikey}
method: POST
retry: true
content_format: json
content:
run_canonical_ids: true
run_enrichments: true
run_master_tables: true
full_refresh: true
keep_debug_tables: true
unification:
!include : unification_ex1.yml
http_call Operator
This operator triggers the Unification Workflow. The endpoint URL depends on your account's region:
Region | URL |
---|---|
US | https://api-cdp.treasuredata.com/unifications/workflow_call |
EU01 | https://api-cdp.eu01.treasuredata.com/unifications/workflow_call |
Tokyo | https://api-cdp.treasuredata.co.jp/unifications/workflow_call |
Korea | https://api-cdp.ap02.treasuredata.com/unifications/workflow_call |
Headers
Authentication is performed using the td.apikey
secret, which contains the Master API Key for your account. No manual setup of the secret is required because it is auto-generated for ID Unification WF.
Warning
If you manually define a td.apikey
secret, ensure it is formatted differently than a standard API Key. Prepend TD1
(with a space) to the API Key value, e.g., TD1 24/939***
. Improper formatting may cause authentication failures.
Content Options
The content
section defines execution options for the ID Unification Workflow and includes a reference to a yml
file.
Option Name | Required? | Default Value | Description |
---|---|---|---|
keep_debug_tables |
Optional | false |
Set to false to discard intermediate debug tables. |
run_canonical_ids |
Optional | true |
Set to false to skip the canonical ID generation process. |
run_enrichments |
Optional | true |
Set to false to skip the enrichment process for canonical IDs. |
run_master_tables |
Optional | true |
Set to false to skip the master table generation process. |
full_refresh |
Optional | true |
If false , performs incremental updates instead of full refreshes. As described in the doc, incremental_update will run full refreshes every 3 days |
Recommendation
Keep run_*
options set to true
to ensure all necessary processes are executed. For keep_debug_tables
, set it to true
. This will help to vidualize the ID Unification statitics later. Configure full_refresh: false
only when incremental updates are required.
Writing a yml File
The example below demonstrates how to write the referenced yml
file (unification_ex1.yml
) from the dig
file example.
name: production
keys:
- name: td_client_id
valid_regexp: "[0-9a-fA-F]{8}-..."
invalid_texts: ['']
- name: td_global_id
valid_regexp: "[0-9a-fA-F]{8}-..."
invalid_texts: ['', '0000000000-...']
- name: email
valid_regexp: ".*@.*"
tables:
- database: prod
table: pageviews
incremental_columns: [updated_at, id]
key_columns:
- {column: td_client_id, key: td_client_id}
- database: brand2
table: pageviews
as: brand2_pageviews
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- {column: email, key: email}
- database: prod
table: contacts
key_columns:
- {column: email, key: email}
canonical_ids:
- name: browser_id
merge_by_keys: [td_client_id, td_global_id]
master_tables:
- name: marketing_master
canonical_id: browser_id
attributes:
- name: browser_id
source_canonical_id: browser_id
- name: email
source_columns:
- {table: contacts, column: email}
Names
Specifies the name of this Unification. This name will be entered as the database name (cdp_unification_${name}
) that will store the output of the ID Unification. In the above example, the cdp_unification_production database
is created and the output is stored.
Keys
Define the keys used for stitching IDs across all tables.
Option Name | Required? | Description |
---|---|---|
name |
Required | Key name. Use consistent naming even if column names differ across tables. |
valid_regexp |
Optional | Regular expression to validate key values. |
invalid_texts |
Optional | Array of invalid key values to exclude. Ex. '' , N/A |
Tips: Regular Expression for td_client_id
/ td_global_id
The values for td_client_id
and td_global_id
are similar to:
-
307423ab-9cbc-4bca-9cae-05c3700cc8f2
-
fc6422e1-48bb-4396-b7aa-bdd2075c000d
These can be uniformly expressed as the following regular expression:
-
[0-9a-fA-F]{8}-[0-9a-fA-F]{4}[0-9a-fA-F]{4}[0-9a-fA-F]{4}[0-9a-fA-F]{12}
Tables
Configure the tables used in the unification process.
Option Name | Required? | Description |
---|---|---|
database |
Required | Specify the database name. |
table |
Required | Specify the table name. |
key_columns |
Required | Map table columns to defined keys in Array(Map). |
as |
Optional | Alias for the table to avoid name collisions during output. |
incremental_columns |
Optional | Specify columns for incremental updates (e.g., [updated_at, id] ) in Array(String). |
Tips: When the column and key differ in key_columns
In most cases, the values of column
and key
may match. However, there are instances where the identifier recorded under the column name cookie_id
in one table needs to be stitched together with the td_client_id
column from another table (registered as td_client_id
in keys
).
For such cases, it would be written as:
-
{column: cookie_id, key: td_client_id}
Tips: Cases where as:
should be used
This option is necessary in situations like the current example, where there are tables with the same name, such as pageviews
, in different databases, and both are used as source tables. When creating an enriched table in the destination database, the table names would conflict. To avoid this collision, the as:
option is used.
canonical_ids
Defines how IDs are unified and assigns canonical identifiers.
Example:
canonical_ids:
- name: browser_id
merge_by_keys: [td_client_id, td_global_id, td_ssc_id]
- name: marketing_id
merge_by_canonical_ids: [browser_id]
merge_by_keys: [email]
source_tables: [pageviews, contacts]
- name: contact_id
merge_by_canonical_ids: [browser_id]
merge_by_keys: [membership_id, email]
merge_iterations: 3
incremental_merge_iterations: 2
Options:
Option Name | Required? | Type | Description |
---|---|---|---|
name |
Required | String | Name of the canonical ID. |
merge_by_keys |
Required | Array(String) | Keys used for merging. The order specifies priority. |
merge_iterations |
Optional | Integer | Number of iterations for the merging algorithm. Defaults to 3 . |
incremental_merge_iterations |
Optional | Integer | Number of iterations for incremental updates. Defaults to 2 . |
merge_by_canonical_ids |
Optional | Array(String) | Used to create a canonical ID by merging other canonical IDs. |
master_tables
Specifies the master output tables and attributes for unified profiles.
Example:
master_tables:
- name: marketing_master
canonical_id: browser_id
attributes:
- name: email
source_columns:
- {table: contacts, column: email}
Options for canonical_ids
Option Name | Required? | Value | Description |
---|---|---|---|
name: |
Required | String | Sets the name of the canonical_id . |
merge_by_keys: |
Required | String | Specifies which set of keys will be used to perform stitching (Unification Algorithm). The variation of keys and the order in which they are set (priority) are extremely important. |
merge_iterations: |
Optional | Integer | Sets the number of loops for the Unification Algorithm, which is processed iteratively. The appropriate number of iterations depends on the problem (data), but insufficient looping can be identified from the execution results. Even if a higher number is set, the process will stop once the algorithm does. Start with 5 or 10 iterations for testing. The default value is 3 . |
incremental_merge_iterations: |
Optional | Integer | Sets the number of iterations used during incremental_update. The default value is 2 . |
source_tables: |
Optional | Array(String) | Lists specific tables from those defined in tables: to use only the keys from these tables for unification. If not specified, all tables are referenced. |
merge_by_canonical_ids: |
Optional | Array(String) | Used to create a canonical_id by merging other canonical_ids as described here. |
(Reference) Naming a canonical_id
When naming a canonical_id
, it is important to make the name clear enough to indicate what the unified ID represents. Additionally, keep the name concise, as it will appear in the output table name. For example:
-
browser_id, unified_cookie_id
-
Used when integrating identifiers such as
td_client_id
,td_global_id
,td_ssc_id
, andcookie_id
.
-
Used when integrating identifiers such as
-
person_id, customer_id, cust_id, member_id
-
Used when integrating keys containing personal information such as
email
ormember_id
.
-
Used when integrating keys containing personal information such as
Options for master_tables
master_tables:
- name: marketing_master
canonical_id: marketing_id
attributes:
- name: browser_id
source_canonical_id: browser_id
- name: firstname
source_columns:
- {table: pageviews}
- {table: form_submits, column: first_name}
- name: birthdate
valid_regexp: "[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}"
invalid_texts: ['']
source_columns:
- {table: contacts, priority: 1}
- {table: pageviews, priority: 2}
- {table: form_submits, priority: 3}
- name: all_emails
array_elements: 5
source_columns:
- {table: contacts, order: last, order_by: time, priority: 1}
- {table: pageviews, order: first, order_by: time, priority: 2}
- {table: form_submits, order: first, order_by: time, priority: 2}
Option Name | Required? | Value | Description |
---|---|---|---|
name: |
Required | String | Specifies the name of the master_table that will hold the canonical_id defined by canonical_id: . |
canonical_id: |
Required | String | Specifies the base canonical_id for the master_table . The number of rows in the master_table corresponds to the unique count of the canonical_id . Note that if the merge_by_keys: configuration for the canonical_id does not cover the key combinations across all source tables, an error will occur (since enrichment across all source tables will not be possible). |
attributes: |
Optional | (Deep Nests) | Additional columns from the source tables, other than the canonical_id , can be specified as attributes: to include them in the master_table . This option is explained in detail below. |
Options for attributes:
Option Name | Required? | Value | Description |
---|---|---|---|
name: |
Required | String | Specifies the attribute name, which will become the column name in the master_table . |
array_elements: |
Optional | Array(String) | If the specified column contains multiple values for a single canonical_id (e.g., td_client_id used as keys), this option allows the column to store multiple values as an array. Specifies the number of values to include (up to 10 ). If omitted, the column will contain only a single value. |
source_columns: |
Required | (Deep Nests) | Since the specified attribute may span multiple tables and records, this option defines which table and record values to prioritize. If the column name in the table differs from the name: specified, use the column: option here to specify the column name. |
valid_regexp: |
Optional | Regular Expression | Same as the option with the same name in keys: . |
invalid_texts: |
Optional | Array(String) | Same as the option with the same name in keys: . |
Options for source_columns:
Option Name | Required? | Value | Description |
---|---|---|---|
table: |
Required | String | Name of the table to use as the attribute (database name is not required). |
column: |
Optional | String | If the column name in the table differs from the name: specified, specify the column name here. |
priority: |
Optional | Integer | Specifies the priority for selecting values when multiple tables are provided. Tables can share the same priority level. Within the same priority level, the order: option determines the order of records in the table. |
order: |
Optional | String | Specifies the sorting order for records based on the column defined in order_by: (default: "time"). Use :first for ascending (ASC) and :last for descending (DESC). Default is :last . |
order_by: |
Optional | String | Used with the order: option to specify the column for sorting. Default is "time". |
Notes on Setting the Same priority:
If multiple tables are assigned the same priority:
, they must share the same sorting order. If different sorting orders are specified, an error will occur. For tables with the same priority level, the specified sorting is uniformly applied, and values are selected sequentially from top to bottom.