ID Unification Tips
Tips: Visualizing a Graph Table Using Graphviz
Step 1: Query and Export Graph Table Data as CSV
Run the following SQL query on the graph_unify_loop_${N}
(or graph
) table for each iteration, and download the resulting data as a CSV file.
SELECT
CONCAT('"',leader_id,'"') AS leader_id,
CONCAT('"',follower_id,'"') AS follower_id,
'' AS node_label,
'' AS node_shape,
'' AS edge_label
FROM ${canonical_id_name}_graph_unify_loop_${N};
Example output:
leader_id | follower_id | node_label | node_shape | edge_label |
---|---|---|---|---|
"""aaa_001""" | """3rd_008""" | |||
"""aaa_001""" | """3rd_013""" | |||
"""aaa_001""" | """3rd_005""" | |||
"""aaa_001""" | """xxx_004""" |
Step 2: Edit CSV File
Open the CSV file in a text editor and replace all instances of '""'
with ''
. This will clean up unnecessary quotes around the values.
Edited example:
leader_id | follower_id | node_label | node_shape | edge_label |
---|---|---|---|---|
"aaa_001" | "3rd_008" | |||
"aaa_001" | "3rd_013" | |||
"aaa_001" | "3rd_005" | |||
"aaa_001" | "xxx_004" |
Step 3: Convert CSV to DOT Language Graph
Ex. Use the 3rd party service - https://docpose.com/csv-to-dot
Tips: Expanding and Visualizing Histogram Columns from result_key_stats
(Example: 100 Rows)
Overview
In this guide, we process and visualize histogram columns such as histogram_${key_name}
from the result_key_stats
table. Using SQL, we transform these columns into a structured format and plot histograms and pie charts for insights.
Sample result_key_stats
Table
from_table | total_distinct | distinct_with_email | distinct_with_td_client_id | distinct_with_td_global_id | distinct_with_td_ssc_id | histogram_email | histogram_td_client_id | histogram_td_global_id | histogram_td_ssc_id | time |
---|---|---|---|---|---|---|---|---|---|---|
* | 1909226 | 30579 | 1909096 | 1908998 | 680373 | 1:30247,2:311,3:14,4:4,7:1,25:1,31:1 | 1:1738227,2:115270,3:26894,4:10851,5:5474,... | 1:1575567,2:166595,3:69711,4:27998,5:16761,... | 1:563326,2:96817,3:12685,4:3563,5:1579,... | 1689326177 |
SQL Query to Transform and Normalize Histogram Data
The following SQL extracts, normalizes, and formats histogram data for plotting. It includes logic to handle missing x-values and calculate percentages for each bin.
WITH tbl_histogram AS
(
SELECT
CAST(SPLIT_PART(elm, ':', 1) AS BIGINT) AS x,
CAST(SPLIT_PART(elm, ':', 2) AS BIGINT) AS num
FROM (
SELECT
SPLIT(histogram_${key_name}, ',') AS hist_ary,
ROW_NUMBER() OVER (ORDER BY time DESC) AS id
FROM ${canonical_id_name}_result_key_stats
WHERE from_table = '*'
)
CROSS JOIN UNNEST(hist_ary) AS t(elm)
WHERE id = 1
),
tbl_serial_numbers AS
(
SELECT val
FROM (VALUES (SEQUENCE(1, 100, 1))) AS t(seq_ary)
CROSS JOIN UNNEST(seq_ary) AS t(val)
)
SELECT x, num, num * 1.0 / SUM(num) OVER (ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS percentage_of_total
FROM
(
SELECT x, num
FROM tbl_histogram
UNION ALL
SELECT x, 0 AS num
FROM (SELECT val AS x FROM tbl_serial_numbers)
WHERE x NOT IN (SELECT x FROM tbl_histogram)
)
ORDER BY x
LIMIT 100;
Key Components of the Query
-
Percentage Calculation
percentage_of_total
computes the proportion of eachx
relative to the total. -
Histogram Bin Range
SEQUENCE(1, 100, 1)
ensures the x-axis ranges from 1 to 100, even if certain bins are missing. -
Row Limit
LIMIT 100
restricts the output to the first 100 bins.
Example Visualizations
1. distinct_with_email
Histogram
Output Table
x | num | percentage_of_total |
---|---|---|
1 | 30337 | 0.99 |
2 | 313 | 0.01 |
3 | 14 | 0.00 |
4 | 4 | 0.00 |
5 | 0 | 0.00 |
... | ... | ... |
Observations
-
Most
canonical_id
s are mapped to only oneemail
. -
Small counts exist for multiple
emails
percanonical_id
.
Bar Graph
Pie Chart
2. distinct_with_td_client_id
Histogram
Output Table
x | num | percentage_of_total |
---|---|---|
1 | 1690899 | 0.90 |
2 | 121851 | 0.07 |
3 | 28552 | 0.02 |
4 | 11489 | 0.01 |
5 | 5783 | 0.00 |
... | ... | ... |
Observations
-
Majority of
canonical_id
s correspond to a singletd_client_id
. -
Potential reasons:
-
Many users interact briefly, resulting in only one
td_client_id
. -
Incomplete stitching of
td_client_id
s across sessions.
-
Many users interact briefly, resulting in only one
Bar Graph
Pie Chart
Insights and Considerations
-
Stitching Accuracy
Results reflect the accuracy ofcanonical_id
stitching. High counts forx=1
may indicate insufficient linking across identifiers. -
User Behavior Patterns
Distributions help identify how users interact with your platform (e.g., single or multiple identifiers). -
Data Quality
Missing or misaligned data may result in skewed histograms.
By visualizing these patterns, we gain deeper insights into user identity stitching and potential areas for data enhancement.