Files Required for ID Unification Workflow

To set up a Unification Workflow, the following two files are essential:

  1. Dig file : Used to invoke the Unification Workflow.
  2. YAML file : Defines the data sources and stitching keys used for ID unification.

This document provides detailed instructions on how to write these files. It includes all available options, so refer to this guide if you're uncertain about specific configurations.

Writing a dig File

The following example serves as a template for creating a dig file. The file primarily sets options for running the Unification Workflow (WF). You can use this as a basic template for your implementation.

Copy
Copied
+call_unification:
  http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:
    run_canonical_ids: true
    run_enrichments: true
    run_master_tables: true

    full_refresh: true
    keep_debug_tables: true

    unification:
      !include : unification_ex1.yml

http_call Operator

This operator triggers the Unification Workflow. The endpoint URL depends on your account's region:

Region URL
US https://api-cdp.treasuredata.com/unifications/workflow_call
EU01 https://api-cdp.eu01.treasuredata.com/unifications/workflow_call
Tokyo https://api-cdp.treasuredata.co.jp/unifications/workflow_call
Korea https://api-cdp.ap02.treasuredata.com/unifications/workflow_call

Headers

Authentication is performed using the td.apikey secret, which contains the Master API Key for your account. No manual setup of the secret is required because it is auto-generated for ID Unification WF.

Warning

If you manually define a td.apikey secret, ensure it is formatted differently than a standard API Key. Prepend TD1 (with a space) to the API Key value, e.g., TD1 24/939***. Improper formatting may cause authentication failures.

Content Options

The content section defines execution options for the ID Unification Workflow and includes a reference to a yml file.

Option Name Required? Default Value Description
keep_debug_tables Optional false Set to false to discard intermediate debug tables.
run_canonical_ids Optional true Set to false to skip the canonical ID generation process.
run_enrichments Optional true Set to false to skip the enrichment process for canonical IDs.
run_master_tables Optional true Set to false to skip the master table generation process.
full_refresh Optional true If false, performs incremental updates instead of full refreshes. As described in the doc, incremental_update will run full refreshes every 3 days
Recommendation

Keep run_* options set to true to ensure all necessary processes are executed. For keep_debug_tables, set it to true. This will help to vidualize the ID Unification statitics later. Configure full_refresh: false only when incremental updates are required.

Writing a yml File

The example below demonstrates how to write the referenced yml file (unification_ex1.yml) from the dig file example.

Copy
Copied
name: production
keys:
  - name: td_client_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['']

  - name: td_global_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['', '0000000000-...']

  - name: email
    valid_regexp: ".*@.*"

tables:
  - database: prod
    table: pageviews
    incremental_columns: [updated_at, id]
    key_columns:
      - {column: td_client_id, key: td_client_id}

 - database: brand2
    table: pageviews
    as: brand2_pageviews
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}
      - {column: email, key: email}

  - database: prod
    table: contacts
    key_columns:
      - {column: email, key: email}

canonical_ids:
  - name: browser_id
    merge_by_keys: [td_client_id, td_global_id]

master_tables:
  - name: marketing_master
    canonical_id: browser_id

    attributes:
      - name: browser_id
        source_canonical_id: browser_id

      - name: email
        source_columns:
          - {table: contacts, column: email}

Names

Specifies the name of this Unification. This name will be entered as the database name (cdp_unification_${name}) that will store the output of the ID Unification. In the above example, the cdp_unification_production database is created and the output is stored.

Keys

Define the keys used for stitching IDs across all tables.

Option Name Required? Description
name Required Key name. Use consistent naming even if column names differ across tables.
valid_regexp Optional Regular expression to validate key values.
invalid_texts Optional Array of invalid key values to exclude. Ex. '', N/A

Tips: Regular Expression for td_client_id / td_global_id

The values for td_client_id and td_global_id are similar to:

  • 307423ab-9cbc-4bca-9cae-05c3700cc8f2
  • fc6422e1-48bb-4396-b7aa-bdd2075c000d

These can be uniformly expressed as the following regular expression:

  • [0-9a-fA-F]{8}-[0-9a-fA-F]{4}[0-9a-fA-F]{4}[0-9a-fA-F]{4}[0-9a-fA-F]{12}

Tables

Configure the tables used in the unification process.

Option Name Required? Description
database Required Specify the database name.
table Required Specify the table name.
key_columns Required Map table columns to defined keys in Array(Map).
as Optional Alias for the table to avoid name collisions during output.
incremental_columns Optional Specify columns for incremental updates (e.g., [updated_at, id]) in Array(String). 

Tips: When the column and key differ in key_columns

In most cases, the values of column and key may match. However, there are instances where the identifier recorded under the column name cookie_id in one table needs to be stitched together with the td_client_id column from another table (registered as td_client_id in keys).

For such cases, it would be written as:

  • {column: cookie_id, key: td_client_id}

Tips: Cases where as: should be used

This option is necessary in situations like the current example, where there are tables with the same name, such as pageviews, in different databases, and both are used as source tables. When creating an enriched table in the destination database, the table names would conflict. To avoid this collision, the as: option is used.

canonical_ids

Defines how IDs are unified and assigns canonical identifiers.

Example:

Copy
Copied
canonical_ids:
  - name: browser_id
    merge_by_keys: [td_client_id, td_global_id, td_ssc_id]

  - name: marketing_id
    merge_by_canonical_ids: [browser_id]
    merge_by_keys: [email]
    source_tables: [pageviews, contacts]

  - name: contact_id
    merge_by_canonical_ids: [browser_id]
    merge_by_keys: [membership_id, email]
    merge_iterations: 3
    incremental_merge_iterations: 2

Options:

Option Name Required? Type Description
name Required String Name of the canonical ID.
merge_by_keys Required Array(String) Keys used for merging. The order specifies priority.
merge_iterations Optional Integer Number of iterations for the merging algorithm. Defaults to 3.
incremental_merge_iterations Optional Integer Number of iterations for incremental updates. Defaults to 2.
merge_by_canonical_ids Optional Array(String) Used to create a canonical ID by merging other canonical IDs.

master_tables

Specifies the master output tables and attributes for unified profiles.

Example:

Copy
Copied
master_tables:
  - name: marketing_master
    canonical_id: browser_id
    attributes:
      - name: email
        source_columns:
          - {table: contacts, column: email}

Options for canonical_ids

Option Name Required? Value Description
name: Required String Sets the name of the canonical_id.
merge_by_keys: Required String Specifies which set of keys will be used to perform stitching (Unification Algorithm). The variation of keys and the order in which they are set (priority) are extremely important.
merge_iterations: Optional Integer Sets the number of loops for the Unification Algorithm, which is processed iteratively. The appropriate number of iterations depends on the problem (data), but insufficient looping can be identified from the execution results. Even if a higher number is set, the process will stop once the algorithm does. Start with 5 or 10 iterations for testing. The default value is 3.
incremental_merge_iterations: Optional Integer Sets the number of iterations used during incremental_update. The default value is 2.
source_tables: Optional Array(String) Lists specific tables from those defined in tables: to use only the keys from these tables for unification. If not specified, all tables are referenced.
merge_by_canonical_ids: Optional Array(String) Used to create a canonical_id by merging other canonical_ids as described here.

(Reference) Naming a canonical_id

When naming a canonical_id, it is important to make the name clear enough to indicate what the unified ID represents. Additionally, keep the name concise, as it will appear in the output table name. For example:

  • browser_id, unified_cookie_id
    • Used when integrating identifiers such as td_client_id , td_global_id , td_ssc_id , and cookie_id .
  • person_id, customer_id, cust_id, member_id
    • Used when integrating keys containing personal information such as email or member_id .

Options for master_tables

Copy
Copied
master_tables:
  - name: marketing_master
    canonical_id: marketing_id
    attributes:
      - name: browser_id
        source_canonical_id: browser_id

      - name: firstname
        source_columns:
          - {table: pageviews}
          - {table: form_submits, column: first_name}

      - name: birthdate
        valid_regexp: "[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}"
        invalid_texts: ['']
        source_columns:
          - {table: contacts, priority: 1}
          - {table: pageviews, priority: 2}
          - {table: form_submits, priority: 3}

      - name: all_emails
        array_elements: 5
        source_columns:
          - {table: contacts, order: last, order_by: time, priority: 1}
          - {table: pageviews, order: first, order_by: time, priority: 2}
          - {table: form_submits, order: first, order_by: time, priority: 2}
Option Name Required? Value Description
name: Required String Specifies the name of the master_table that will hold the canonical_id defined by canonical_id:.
canonical_id: Required String Specifies the base canonical_id for the master_table. The number of rows in the master_table corresponds to the unique count of the canonical_id. Note that if the merge_by_keys: configuration for the canonical_id does not cover the key combinations across all source tables, an error will occur (since enrichment across all source tables will not be possible).
attributes: Optional (Deep Nests) Additional columns from the source tables, other than the canonical_id, can be specified as attributes: to include them in the master_table. This option is explained in detail below.

Options for attributes:

Option Name Required? Value Description
name: Required String Specifies the attribute name, which will become the column name in the master_table.
array_elements: Optional Array(String) If the specified column contains multiple values for a single canonical_id (e.g., td_client_id used as keys), this option allows the column to store multiple values as an array. Specifies the number of values to include (up to 10). If omitted, the column will contain only a single value.
source_columns: Required (Deep Nests) Since the specified attribute may span multiple tables and records, this option defines which table and record values to prioritize. If the column name in the table differs from the name: specified, use the column: option here to specify the column name.
valid_regexp: Optional Regular Expression Same as the option with the same name in keys:.
invalid_texts: Optional Array(String) Same as the option with the same name in keys:.

Options for source_columns:

Option Name Required? Value Description
table: Required String Name of the table to use as the attribute (database name is not required).
column: Optional String If the column name in the table differs from the name: specified, specify the column name here.
priority: Optional Integer Specifies the priority for selecting values when multiple tables are provided. Tables can share the same priority level. Within the same priority level, the order: option determines the order of records in the table.
order: Optional String Specifies the sorting order for records based on the column defined in order_by: (default: "time"). Use :first for ascending (ASC) and :last for descending (DESC). Default is :last.
order_by: Optional String Used with the order: option to specify the column for sorting. Default is "time".

Notes on Setting the Same priority:

If multiple tables are assigned the same priority:, they must share the same sorting order. If different sorting orders are specified, an error will occur. For tables with the same priority level, the specified sorting is uniformly applied, and values are selected sequentially from top to bottom.