ID Unification Tips

Tips: Visualizing a Graph Table Using Graphviz

Step 1: Query and Export Graph Table Data as CSV

Run the following SQL query on the graph_unify_loop_${N} (or graph) table for each iteration, and download the resulting data as a CSV file.

Copy
Copied
SELECT
  CONCAT('"',leader_id,'"') AS leader_id,
  CONCAT('"',follower_id,'"') AS follower_id,
  '' AS node_label,
  '' AS node_shape,
  '' AS edge_label
FROM ${canonical_id_name}_graph_unify_loop_${N};

Example output:

leader_id follower_id node_label node_shape edge_label
"""aaa_001""" """3rd_008"""
"""aaa_001""" """3rd_013"""
"""aaa_001""" """3rd_005"""
"""aaa_001""" """xxx_004"""

Step 2: Edit CSV File

Open the CSV file in a text editor and replace all instances of '""' with ''. This will clean up unnecessary quotes around the values.

Edited example:

leader_id follower_id node_label node_shape edge_label
"aaa_001" "3rd_008"
"aaa_001" "3rd_013"
"aaa_001" "3rd_005"
"aaa_001" "xxx_004"

Step 3: Convert CSV to DOT Language Graph

Ex. Use the 3rd party service - https://docpose.com/csv-to-dot

Tips: Expanding and Visualizing Histogram Columns from result_key_stats (Example: 100 Rows)

Overview

In this guide, we process and visualize histogram columns such as histogram_${key_name} from the result_key_stats table. Using SQL, we transform these columns into a structured format and plot histograms and pie charts for insights.

Sample result_key_stats Table

from_table total_distinct distinct_with_email distinct_with_td_client_id distinct_with_td_global_id distinct_with_td_ssc_id histogram_email histogram_td_client_id histogram_td_global_id histogram_td_ssc_id time
* 1909226 30579 1909096 1908998 680373 1:30247,2:311,3:14,4:4,7:1,25:1,31:1 1:1738227,2:115270,3:26894,4:10851,5:5474,... 1:1575567,2:166595,3:69711,4:27998,5:16761,... 1:563326,2:96817,3:12685,4:3563,5:1579,... 1689326177

SQL Query to Transform and Normalize Histogram Data

The following SQL extracts, normalizes, and formats histogram data for plotting. It includes logic to handle missing x-values and calculate percentages for each bin.

Copy
Copied
WITH tbl_histogram AS
(
  SELECT
     CAST(SPLIT_PART(elm, ':', 1) AS BIGINT) AS x,
     CAST(SPLIT_PART(elm, ':', 2) AS BIGINT) AS num
  FROM (
    SELECT
      SPLIT(histogram_${key_name}, ',') AS hist_ary,
      ROW_NUMBER() OVER (ORDER BY time DESC) AS id
    FROM ${canonical_id_name}_result_key_stats
    WHERE from_table = '*'
  )
  CROSS JOIN UNNEST(hist_ary) AS t(elm)
  WHERE id = 1
),
tbl_serial_numbers AS
(
  SELECT val
  FROM (VALUES (SEQUENCE(1, 100, 1))) AS t(seq_ary)
  CROSS JOIN UNNEST(seq_ary) AS t(val)
)
SELECT x, num, num * 1.0 / SUM(num) OVER (ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS percentage_of_total
FROM
(
  SELECT x, num
  FROM tbl_histogram
  UNION ALL
  SELECT x, 0 AS num
  FROM (SELECT val AS x FROM tbl_serial_numbers)
  WHERE x NOT IN (SELECT x FROM tbl_histogram)
)
ORDER BY x
LIMIT 100;

Key Components of the Query

  1. Percentage Calculation
    percentage_of_total computes the proportion of each x relative to the total.
  2. Histogram Bin Range
    SEQUENCE(1, 100, 1) ensures the x-axis ranges from 1 to 100, even if certain bins are missing.
  3. Row Limit
    LIMIT 100 restricts the output to the first 100 bins.

Example Visualizations

1. distinct_with_email Histogram

Output Table
x num percentage_of_total
1 30337 0.99
2 313 0.01
3 14 0.00
4 4 0.00
5 0 0.00
... ... ...
Observations
  • Most canonical_id s are mapped to only one email .
  • Small counts exist for multiple emails per canonical_id .
Bar Graph

canonical_id_1

Pie Chart

canonical_id_1

2. distinct_with_td_client_id Histogram

Output Table
x num percentage_of_total
1 1690899 0.90
2 121851 0.07
3 28552 0.02
4 11489 0.01
5 5783 0.00
... ... ...
Observations
  • Majority of canonical_id s correspond to a single td_client_id .
  • Potential reasons:
    • Many users interact briefly, resulting in only one td_client_id .
    • Incomplete stitching of td_client_id s across sessions.
Bar Graph

canonical_id_1

Pie Chart

canonical_id_1

Insights and Considerations

  1. Stitching Accuracy
    Results reflect the accuracy of canonical_id stitching. High counts for x=1 may indicate insufficient linking across identifiers.
  2. User Behavior Patterns
    Distributions help identify how users interact with your platform (e.g., single or multiple identifiers).
  3. Data Quality
    Missing or misaligned data may result in skewed histograms.

By visualizing these patterns, we gain deeper insights into user identity stitching and potential areas for data enhancement.