Trino(Presto) Query FAQs

Review one or more of the following FAQs to find the answers to your questions about Trino(Presto) queries.

How Reliable is Trino(Presto)
How can I Migrate Hive Queries into Trino(Presto)
Why is My Query not Working
Why is My Query Slow
How can I Improve Join Performance
How can I fix the Exceeded max local memory xxGB error
How Can I Speed up Queries that Produce Massive Results
How do I Report Daily Ranges into the Columns in a row
How can I Clean up a Table before Updating
How can I Simplify a Query
How can I use JSON
How Can I Specify JSON keys with white spaces or special symbols
How can I use a Binary Function
How do I use Trino(Presto) Geospatial Functions
How do I Access TD table Metadata using Trino(Presto)
How can I fix the PAGE TRANSPORT TIMEOUT Error
How can I Migrate from Trino(Presto)
How can I compute a median in Trino(Presto)
How can I concatenate strings
How can I supply a default value if the column contains NULLs
How can I take only a max or min value of two values

How Reliable is Trino(Presto)

Trino(Presto) transfers the data through the network. Due to connection timeout or some problem in worker nodes, this network data transfer may fail occasionally (PAGE_TRANSPORT_TIMEOUT, WORKER_RESTARTED). Trino(Presto) is designed for faster query processing when compared to Hive, so it sacrifices fault-tolerance, to some extent. Typically, more than 99.5% of Trino(Presto) queries finish without any error on the first run. In addition, Treasure Data provides a query retry mechanism on query failures, so nearly 100% of queries finish successfully after being retried.

Because of the nature of network-based distributed query processing, if your query tries to process billions of records, the chance of hitting network failures increases. If you see PAGE_TRANSPORT_TIMEOUT error frequently, try to reduce the input data size by narrowing down the TD_TIME_RANGE or reducing the number of columns in SELECT statements.

How can I Migrate Hive Queries into Trino(Presto)

Review the migration guide from Hive. Trino(Presto) follows the ANSI SQL syntax and semantics, while Hive uses a different SQL syntax dialect. These differences include:

Quotations:
In Trino(Presto), use double quotations for column and table names (e.g., “column1”), single quotations for strings (for example, ‘hello presto’).
Implicit type conversion:
- In Trino(Presto), SELECT CAST(5 AS DOUBLE) / 2 gives 2.5, but SELECT 5 / 2 gives 2.
- In Hive, both return 2.5.

Why is My Query not Working

Trino(Presto) can process millions of rows in a second. If you see query errors or unexpected results, try to minimize your input data set. Here are some options to fix your query:

Narrow down the input data size by using TD_TIME_RANGE
Create a sample table by using CREATE TABLE AS SELECT … WHERE ….
Minimize your query:
- Remove irrelevant columns from the SELECT statements
- Reduce the number of joins
- Replace SELECT c1, c2, … with SELECT count( * )
  - If you are still experiencing query errors, skip the query result generation with count( * ). This can reduce your investigation time.
Extract sub-queries and check their results individually.
- Running individual subqueries can help identify the performance bottleneck.
Check the conditions in the WHERE clause carefully.
- Checking a NULL value using the equal (=) operator always returns false. Instead, use IS NULL (or IS NOT NULL).
- The following query scans the entire data set even if TD _ TIME _ RANGE is used:

Copy

Copied

SELECT * FROM table1
WHERE
  col1 < 100
  OR col2 is TRUE
  AND TD_TIME_RANGE(time, '2015-11-01')

The full data set scan occurs because AND has stronger precedence than OR. This condition is equivalent to:

Copy

Copied

(col1 < 100) OR (col2 is TRUE AND TD_TIME_RANGE(time, '2015-11-01'))

The first condition does not specify any TD_TIME_RANGE, so it results in scanning the whole table. To fix this, use parenthesis appropriately:

Copy

Copied

(col1 < 100 OR col2 is TRUE) AND TD_TIME_RANGE(time, '2015-11-01')

Create a Sample Dataset with VALUES Clause.

To test Trino(Presto) queries, use the VALUES clause to prepare a sample data set. You don’t need to import a data set. For example:

Copy

Copied

SELECT * FROM (VALUES (1, 'apple'), (2, 'banana')) as fruit(id, name);

## This query gives the following table:
##  id |  name
##-----+--------
##   1 | apple
##   2 | banana

Ask Our Support Team

For Treasure Data customers, if problems persist, see Support Channels or send an e-mail to support@treasuredata.com with the job IDs of your queries. If you can, include information about the expected results and the meaning of your data set.

Why is My Query Slow

Avoid single-node operations

Trino(Presto) is a distributed query engine, but some operators need to be completed in a single process. For example,

count(distinct x) — check the number of unique values.
- Consider using approx_distinct(x) instead.
UNION — check the uniqueness of the tuples by comparing all column values.
- For most of the cases, UNION can be replaced with UNION ALL (no duplicate elimination).
ORDER BY — Trino(Presto) sorts the entire data set in a single node.
- If sorting a million or more rows with ORDER BY is necessary, you can combine Hive and Trino(Presto).

Narrow down the table scan range with TD_TIME_RANGE

Treasure Data is a time-series database and creates 1-hour partitioned buckets. Focusing on the specific time range of the data provides the best query performance.

Copy

Copied

SELECT ... FROM ...
WHERE TD_TIME_RANGE(time, '2015-10-01 PST', '2015-11-01 PST')

Aggregate a series of LIKE clauses into one regexp_like expression.

Trino(Presto)’s query optimizer is unable to improve queries where several LIKE clauses are used. As a consequence, the query execution can be slower than expected. To improve the performance, you can substitute a series of LIKE clauses chained in an OR with a single regexp_like clause. For example the following query:

Copy

Copied

SELECT
  ...
FROM
  access
WHERE
  method LIKE '%GET%' OR
  method LIKE '%POST%' OR
  method LIKE '%PUT%' OR
  method LIKE '%DELETE%'

can be optimized by replacing the 4 LIKE clauses with a single regexp_like function:

Copy

Copied

SELECT
  ...
FROM
  access
WHERE
  regexp\_like(method, 'GET|POST|PUT|DELETE')

Check the 1-hour data distribution

If you have a huge number of rows in a 1-hour partition, processing this partition can be the performance bottleneck. To check the number rows contained in each partition, run a query similar to the following:

Copy

Copied

SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH') hour, count(*) cnt
FROM test_js_sdk
WHERE TD_TIME_RANGE(time, '2015-10-01 UTC', '2015-11-01 UTC')
GROUP BY 1
ORDER BY cnt DESC
LIMIT 100

This query shows the top 100 partitions that contain the highest number of records during 2015-10 to 2015-11.

How can I Improve Join Performance

Use simple equi-joins

The equi-join concatenates tables by comparing join keys using the equal (=) operator. If this comparison becomes complex, the join processing slows down. For example, if you want to join two tables with date string, ‘2015-10-01’, but one of the tables only has columns for year, month, and day values, you can write the following query to generate date strings:

Copy

Copied

SELECT a.date, b.name FROM
left_table a
JOIN right_table b
ON a.date = CAST((b.year * 10000 + b.month * 100 + b.day) as VARCHAR)

This query delays the join processing because the join condition involves several expressions. You can make it faster by pushing this condition down into a subquery to prepare a join key beforehand:

Copy

Copied

SELECT a.date, b.name FROM
left_table a
JOIN (
  SELECT
    CAST((b.year * 10000 + b.month * 100 + b.day) as VARCHAR) date,  ## generate join key
    name
  FROM right_table
) b
ON a.date = b.date  ## Simple equi-join

In this example, join keys are a.date and b.date str columns. Comparing VARCHAR strings is much faster than comparing VARCHAR and expression results.

How can I fix the Exceeded max local memory xxGB error

Rewrite your SQL to use less memory

Trino(Presto) tracks the memory usage of each query. While the available memory varies according to your price plan, in most cases it is possible to rewrite your query to resolve this issue. Here is a list of memory-intensive operations:

distinct
UNION
ORDER BY
GROUP BY (of many columns)
joins

Avoid using distinct operator

distinct eliminates all non-unique rows. For example, the following query checks whether your table contains the same set of (c1, c2, c3) tuples:

Copy

Copied

SELECT distinct c1, c2, c3 FROM my_table

This stores the entire set of columns c1, c2 and c3 into a memory of a single worker node to check the uniqueness of the tuples. The amount of the required memory increases with the number of columns and their size. Remove distinct from your query or use it after reducing the number of input rows by using a subquery.

Replace count(distinct x) with approx_distinct(x)

If you are counting the number of distinct users or events, it is typically better to use the count(distinct id) operation, but this can cause memory issues.

Copy

Copied

SELECT count(distinct id) FROM my_table

To reduce the memory usage, use an approximate version of count(distinct x):

Copy

Copied

SELECT approx_distinct(id) FROM my_table

appro_distinct(x) returns an approximate result of the true value. It ensures that returning a distant value happens only with low probability. If you need to summarize the characteristics of your data set, use this approximate version.

Use UNION ALL instead of UNION.

For the same reason as with distinct, a UNION of SQL queries performs duplicate elimination and requires a substantial amount of memory:

Copy

Copied

SELECT c1, c2, c3 FROM my_table1
UNION
SELECT c1, c2, c3 FROM my_table2

If you use UNION ALL, you can avoid the duplicate elimination:

Copy

Copied

SELECT c1, c2, c3 FROM my_table1
UNION ALL
SELECT c1, c2, c3 FROM my_table2

This requires less memory and is faster. If you need to concatenate two or more SQL query results, use UNION ALL.

Avoid ORDER BY

Copy

Copied

SELECT c1, c2 FROM my_table ORDER BY c1

Trino(Presto) performs sorting using a single worker node. The entire data set must fit within the memory limit of a worker, which is usually less than 5GB.

If you are sorting a small number of rows (~ 10,000 rows), using ORDER BY is fine. If you are going to sort a data set with GBs of data, you might need to find an alternative strategy. If sorting the entire data set is necessary, you can combine Hive and Trino(Presto):

Store the results of your Trino(Presto) query by using CREATE TABLE AS or INSERT INTO query.
Use Hive to sort the data set.

Reduce the number of GROUP BY columns

Copy

Copied

SELECT avg(c1), min_by(c2, time), max(c3), count(c4), ...
FROM my_table
GROUP BY c1, c2, c3, c4, ...

If you enumerate many target columns in a GROUP BY clause, storing the set of keys consisting of (c1, c2, c3, c4, …) requires a lot of memory. Reducing the number of columns in GROUP BY clause will reduce memory usage.

Reduce the size of GROUP BY keys with smart_digest()

We provide a smart_digest(key) UDF for Treasure Data customers, which creates smaller hash values to reduce the size of keys.

Copy

Copied

SELECT smart_digest(path), count(*)
FROM www_access
GROUP BY smart_digest(path)

Join the largest table to smaller tables

The following type of query, that starts with a small table in a join clause, usually causes Trino(Presto) to hit its memory limits:

Copy

Copied

SELECT * FROM small_table, large_table
WHERE small_table.id = large_table.id

Trino(Presto) performs a broadcast join by default, which partitions the left side table into several worker nodes, then sends the entire copy of the right side table to the worker nodes that have a partition. If the right side table is large and doesn’t fit in memory on a worker node, it causes an error.

Reordering the table join to put the largest table first typically solves the issue:

Copy

Copied

SELECT * FROM large_table, small_table
WHERE large_table.id = small_table.id

This query distributes the left table (large_table), greatly reducing the chance of hitting the memory limit.

Use distributed joins

If your query still doesn’t work, try using a distributed join by adding a magic comment that sets a session property:

Copy

Copied

DROP TABLE IF EXISTS my_result;
-- set session join_distribution_type = 'PARTITIONED'
CREATE TABLE my_result AS 
SELECT ... FROM large_table l, small_table s WHERE l.id = s.id

The distributed join algorithms partitions both left and right side tables by using hash values of the join keys as a partitioning key. It works even if the right side table is large. However, it can increase the number of network data transfers and is usually slower than the broadcast join.

How Can I Speed up Queries that Produce Massive Results

The following SQL queries might help you speed up your query results.

This topic includes:

Use CREATE TABLE AS or INSERT INTO
Use CREATE TABLE (table) AS SELECT
Use INSERT INTO (table) SELECT …
Improve Very Large Query Performance
Avoid using SELECT *
Use result_output_redirect

Use CREATE TABLE AS or INSERT INTO

Copy

Copied

SELECT * FROM my_table

Trino(Presto) uses JSON text to materialize query results. If the preceding table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result.

Use CREATE TABLE (table) AS SELECT …

Copy

Copied

DROP TABLE IF EXISTS my_result;
CREATE TABLE my_result AS SELECT * FROM my_table;

You can parallelize the query result output process by using CREATE TABLE AS SELECT statement. To clean up the result table beforehand, add a DROP TABLE statement at the top of your query. The result output performance will be 5x ~ faster than running SELECT *. Our Trino(Presto) skips the JSON output process and directly produces a 1-hour partitioned table.

Use INSERT INTO (table) SELECT …

You can also use INSERT INTO (table) AS SELECT … to append the data to an existing table. It can improve the query result output performance:

Copy

Copied

CREATE TABLE IF NOT EXISTS my_result(time bigint);
INSERT INTO my_result SELECT * FROM my_table;

If the subsequent SELECT statement does not produce the time column, INSERT INTO attaches the query execution time as time column values. You can find the inserted rows using the query execution time.

Improve Very Large Query Performance

You can parallelize the query result output process by using the CREATE TABLE AS SELECT statement. If you DROP the table before running the query, your performance is significantly better. The result output performance will be 5x faster than running SELECT *. Treasure Data Trino(Presto) skips the JSON output process and directly produces a 1-hour partitioned table.

Without using DROP TABLE, Trino(Presto) uses JSON text to materialize query results. And if the result table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, the output of the JSON results takes a long time.

To clean up the result table beforehand:

Add a DROP TABLE statement at the top of your query.
Use CREATE TABLE (table) AS SELECT …

For example, your query might look like the following.

Copy

Copied

DROP TABLE IF EXISTS my_result;
CREATE TABLE my_result AS
    SELECT * FROM my_table;

Avoid using SELECT *

Copy

Copied

SELECT * FROM my_table

Treasure Data uses a column-oriented storage format, so accessing a small set of columns is really fast, but as the number of columns increases in your query, it deteriorates the query performance. Be selective in choosing columns.

Copy

Copied

SELECT id, name, address FROM my_table

Use result_output_redirect

To improve the job performance of queries that produce massive results, result_output_redirect is used to enhance the performance of Trino(Presto) jobs. By default, result_output_redirect is set to true.

This impacts performance even though the query results are ready; TD Trino(Presto) is waiting for the worker node to complete its sequential operations. To mitigate this, Treasure Data now uses result_output_direct, which redirects the query result to S3 in parallel, thereby improving the performance of the queries.

LIMITATION: There are SQL queries where Hints do not improve performance, including:

When a query includes ORDER BY
When a query is CREATE TABLE or INSERT TABLE
When a query includes uppercase letters
When a query has duplicated column names

How do I Report Daily Ranges into the Columns in a row

Sometimes, you may need to report access counts of different date ranges into the columns in a row. For example:

Copy

Copied

SELECT id, d1.total AS day1_total, d2.total AS day2_total
FROM (
 SELECT id, count(*) total
 FROM my_table
 WHERE TD_TIME_RANGE(time,
        TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'),
        TD_SCHEDULED_TIME(), 'UTC')
 GROUP BY id
) d1
LEFT JOIN (
 SELECT id, count(*) total
 FROM my_table
 WHERE TD_TIME_RANGE(time,
        TD_TIME_ADD(TD_SCHEDULED_TIME(), '-2d'),
        TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), 'UTC')
 GROUP BY id
) d2

The result of this query would be:


id	day1_total	day2_total
1	10	13
2	14	3

However, if your input table is huge, this query becomes inefficient because it involves joins and also scans the same table multiple times.

Solution: Create a sparse table.

A more efficient approach is to create a sparse table in a single table scan:

Copy

Copied

SELECT id,
       CASE diff WHEN 0 THEN 1 ELSE 0 END AS day1,
       CASE diff WHEN 1 THEN 1 ELSE 0 END AS day2
FROM (
  SELECT id,
  date_diff('day',
            date_trunc('day', from_unixtime(time)),
            date_trunc('day', from_unixtime(TD_SCHEDULED_TIME()))) AS diff
  FROM my_table
  WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-2d'), TD_SCHEDULED_TIME(), 'UTC')
)


id	day1	day2
1	1	0
2	1	0
1	1	0
1	0	1
2	0	1
2	0	1
…	…	…

Then, aggregate the result:

Copy

Copied

SELECT id,
       sum(day1) AS day1_total,
       sum(day2) AS day2_total
FROM sparse_table
GROUP BY id

For readability, you can write these steps in a single job by using a WITH statement:

Copy

Copied

WITH
-- Compute the date differences of the events from the TD_SCHEDULED_TIME
date_diff_table AS (
  SELECT id,
  date_diff('day',
            date_trunc('day', from_unixtime(time)),
            date_trunc('day', from_unixtime(TD_SCHEDULED_TIME()))) AS diff
  FROM my_table
  WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-2d'), TD_SCHEDULED_TIME(), 'UTC')
),
-- Create a sparse table, which maps daily event counts into day1, day2, ...
sparse_table AS (
  SELECT id,
         CASE diff WHEN 0 THEN 1 ELSE 0 END AS day1,
         CASE diff WHEN 1 THEN 1 ELSE 0 END AS day2
  FROM date_diff_table
)
-- Aggregate the sparse table
SELECT id,
       sum(day1) AS day1_total,
       sum(day2) AS day2_total
FROM sparse_table
GROUP BY id

How can I Clean up a Table before Updating

Use multiple statements

Trino(Presto) supports multiple statement execution, separated by a semi-colon(;):

Copy

Copied

DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table AS SELECT ...;

Multiple statements do not support a transaction. See limitations on multiple updates.

How can I Simplify a Query

Use the With Statement

If your query becomes complex or deeply nested, try to extract subqueries using WITH clause. For example, the following query that has a nested subquery:

Copy

Copied

SELECT a, b, c FROM (
   SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a
) tbl_alias

can be rewritten as follows:

Copy

Copied

WITH tbl_alias AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a)
SELECT a, b, c FROM tbl_alias

You can also enumerate multiple sub-queries in WITH clause, by using a comma:

Copy

Copied

WITH tbl1 AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a),
     tbl2 AS (SELECT a, AVG(d) AS d FROM another_tbl GROUP BY a)
SELECT tbl1.*, tbl2.* FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a

Use WITH statement in CREATE TABLE statement.

If your CREATE TABLE query becomes complex or deeply nested, try to extract subqueries using WITH clause. For example, one sub query can be rewritten as follow:

Copy

Copied

CREATE TABLE tbl_new AS WITH tbl_alias AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl1)
SELECT a, b, c FROM tbl_alias

You can also enumerate multiple sub-queries in WITH clause as in the following:

Copy

Copied

CREATE TABLE tbl_new AS WITH tbl_alias1 AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl1),
                             tbl_alias2 AS (SELECT a, AVG(d) AS d FROM tbl2)
SELECT tbl_alias1.*, tbl2_alias.* FROM tbl_alias1 JOIN tbl_alias2 ON tbl_alias1.a = tbl_alias2.a

Specify GROUP BY targets with numbers.

Group by clause requires having the same expression in the SELECT statement:

Copy

Copied

SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH', 'PDT') hour, count(*) cnt
FROM my_table
GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd HH', 'PDT')  ## <-- redundant expression

You can simplify this query by using GROUP BY 1, 2, . . . :

Copy

Copied

SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH', 'PDT') hour, count(*) cnt
FROM my_table
GROUP BY 1

These numbers correspond to the column indexes (1-origin) of the SELECT statement.

How can I use JSON

Use json_extract function to parse JSON data

Trino(Presto) has several functions for reading JSON strings. To extract a JSON parameter, you can use the json_extract_scalar function.

Copy

Copied

-- This returns 1 (bigint)
SELECT CAST(json_extract_scalar('{"id":1, "name":"Ann"}', '$.id') AS BIGINT) id

Use json_parse + CAST to parse JSON array data

For parsing nested JSON data, you can combine json_parse and cast functions:

Copy

Copied

SELECT * FROM (
  SELECT CAST(json_parse('[1, 2, 3]') AS ARRAY<BIGINT>) items
) CROSS JOIN UNNEST(items) AS t(item);

This produces:


items	item
[1, 2, 3]	1
[1, 2, 3]	2
[1, 2, 3]	3

You can also parse an array of JSON objects:

Copy

Copied

WITH records AS (
  SELECT CAST(json_parse('[{"id":1,"name":"Ann"},{"id":2, "name":"Bob"}]') 
    AS ARRAY<JSON>) records
)
SELECT record FROM records CROSS JOIN UNNEST(records) AS t(record);

This query returns each JSON object in a separate row:


record
{"id":1,"name":"Ann"}
{"id":2,"name":"Bob"}

Then you can use json_extract_scalar to read the parameters of each JSON object:

Copy

Copied

WITH records AS (
  SELECT CAST(json_parse('[{"id":1,"name":"Ann"},{"id":2, "name":"Bob"}]') AS ARRAY<JSON>) records
)
SELECT
   CAST(json_extract_scalar(record, '$.id') AS BIGINT) id,
   CAST(json_extract_scalar(record, '$.name') AS VARCHAR) name
FROM records CROSS JOIN UNNEST(records) AS t(record);


id	name
1	Ann
2	Bob

How can I Specify JSON keys with White Spaces or Special Symbols

Use JSON paths with square brackets and double quotations.

Copy

Copied

SELECT json_extract('{"a b":1}', '$["a b"]');

This returns 1.

How can I use a Binary Function

Trino(Presto) supports binary functions. To use a binary function on Treasure Data, you need to convert BINARY/VARBINARY types to varchar type.

Convert to SHA256

Copy

Copied

SELECT to_hex(sha256(to_utf8('support@treasure-data.com'))) as email
=> "D1F9E365DED3996161145F1FE41A1BE56365E314795DFB07F7B930A6BFF7248E"

Convert from/to base64

Copy

Copied

SELECT to_base64(to_utf8('support@treasure-data.com')) as email
=> "c3VwcG9ydEB0cmVhc3VyZS1kYXRhLmNvbQ=="
SELECT FROM_UTF8(from_base64('c3VwcG9ydEB0cmVhc3VyZS1kYXRhLmNvbQ=='))
=> "support@treasure-data.com"

How do I use Trino(Presto) Geospatial Functions

Store Geospatial Data

The building blocks of geospatial functions are the Geometry data type. Geometry data types are expressed using the WKT format. Well-known text (WKT) is a text markup language for representing vector geometry objects on a map. Examples include:

POINT (0 0)

LINESTRING (0 0, 1 1, 1 2)

POLYGON ((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1))

MULTIPOINT (0 0, 1 2)

MULTILINESTRING ((0 0, 1 1, 1 2), (2 3, 3 2, 5 4))

MULTIPOLYGON (((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1)), ((-1 -1, -1 -2, -2 -2, -2 -1, -1 -1)))

GEOMETRYCOLLECTION (POINT(2 3), LINESTRING (2 3, 3 4))

As WKT is a text format, Geometry datatypes are stores as VARCHAR or STRING datatypes. (Excluding the Point data type which can be store as a DOUBLE representing the latitude and longitude.). For example, to store a polygon in a table you would store the WKT presentation of a polygon. The actual text content that would be stored in TD table would be: ‘POLYGON ((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1))’. If you needed to store a multipoint line the actual text you would store in the TD table would be: ‘MULTIPOINT (0 0, 1 2)’.

Convert String Data to Geometry Type

All Trino(Presto) geospatial functions require the Geometry data type. The data is stored in TD as a WKT formatted string. As a result, the WKT string data has to be converted into a Geometry type before it can be used in geospatial functions. The conversion from WKT string data to Geometry type is done using geospatial constructor functions. These functions accept WKT formatted text as arguments and return the Geometry data type required by geospatial functions. Examples of constructor functions include:

ST _ Polygon(varchar) → Polygon Actual example: ST_POLYGON(POLYGON ((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1)) Will accept the WKT text representation and return a polygon geometry data type.
ST _ GeometryFromText(varchar) → Geometry Actual Example: ST_GeometryFromText(MULTILINESTRING ((0 0, 1 1, 1 2), (2 3, 3 2, 5 4))) Will accept the WKT text representation and return a multiline string geometry data type.

Run Geospatial Queries in Trino(Presto)

Let’s assume you had a database with two tables: Cities & Trips. The columns of the Cities table would be city and area. The city column would be a String (Dallas) and the area column would be a string WKT representation of the Dallas area (polygon) ((0.3528644 32.5869589, 0.3587434,32.5963149, 0.3353124,32.6015929,0.3528644,32.5869589)). Your second table, Trips would be a list of bus rides taken. The columns would be username, a string of a user’s name (Peter) and the journey would be a WKT string representation of a linestring for each trip taken: LINESTRING (0 0, 1 1, 1 2).

Given that, let us assume that you wanted to find how many passed through certain cities. The geospatial function to find that is:

ST_Intersects(Geometry, Geometry) → boolean

An example query is:

Copy

Copied

SELECT c.city AS City, count(*) as NumTrips
FROM trips as t
JOIN cities as c
on  ST_Intersects(ST_Polygon(c.area), ST_LineFromText(t.journey))
GROUP BY 1

The output would be:


City	NumTrips
Ames	10
New York	11
Dallas	3
Vegas	44

Calculate the Distance in Kilometers Between two Points using Latitude and Longitude.

For Trino(Presto), ST_DISTANCE returns a spatial-ref that at times can be impractical in use cases where you need the distance between two points in kilometers. If you need to calculate distance in kilometers, you have to use an SQL query. Assuming you had two locations: A(lat1, long1), B(lat1, long2) the SQL query to find the distance between the two points in kilometers is:

Copy

Copied

select 6371 * ACOS(COS(RADIANS(lat2)) * COS(RADIANS(lat1)) 
* COS(RADIANS(long1) - RADIANS(long2)) 
+ SIN(RADIANS(long1))* SI(RADIANS(lat1)));

Note: 6371 is the diameter of the earth.

The result would be the distance between the two points in kilometers. For example: 40.45

How do I Access TD table Metadata using Trino(Presto)

Treasure Data provides the information_schema database as a preset to provide tables for accessing table metadata information.

You can query our internal database information_schema.

Copy

Copied

## List TD Databases
SELECT * from information_schema.schemata

## List TD Tables
SELECT * from information_schema.tables

## List all column metadata
SELECT * from information_schema.columns

Optionally, you may use Trino(Presto) SHOW commands to see similar information:

Copy

Copied

## List Databases
SHOW SCHEMAS

## List Tables
SHOW TABLES FROM <database_name>

## List column metadata for a specific table
SHOW COLUMNS FROM <database_name>.<table_name>

How can I fix the PAGE TRANSPORT TIMEOUT Error

Trino(Presto) transfers the data through the network. Because of connection timeouts or problems in worker nodes, this network data transfer may occasionally fail with the PAGE_TRANSPORT_TIMEOUT error.

Trino(Presto) is designed for faster query processing when compared to Hive, so it sacrifices fault-tolerance, to some extent. Typically, more than 99.5% of Trino(Presto) queries finish without any error on the first run.

Treasure Data provides a query retry mechanism on query failures, so nearly 100% of queries finish successfully after being retried.

Because of the nature of network-based distributed query processing, if your query tries to process billions of records, the chance of hitting network failures increases. If you start seeing the PAGE_TRANSPORT_TIMEOUT error frequently, try to reduce the input data size by narrowing down the TD_TIME_RANGE or reducing the number of columns in SELECT statements.

How can I Migrate from Trino(Presto)

The TD_SESSIONIZE() function has been deprecated because of performance issues and sometimes-inconsistent results and will be removed in a future release of Treasure Data Trino(Presto).

The UDF TD_SESSIONIZE_WINDOW() was introduced in 2016 to replace TD_SESSIONIZE. It is a Trino(Presto) window function with equivalent functionality, more consistent results and faster, more reliable performance.

Code that uses TD_SESSIONIZE() should be rewritten as soon as possible to use TD_SESSIONIZE_WINDOW().

This article describes how to rewrite a query using TD_SESSIONIZE() to use TD_SESSIONIZE_WINDOW().

Sessionization of a table of event data groups a series of event rows associated with users into individual sessions for analysis. As long as the series of events is associated with the same user identifier (typically IP address, email, cookie or similar identifier) and events are separated by no more than a chosen timeout interval, those events can be grouped into a session. Each session is then assigned a unique session ID.

TD_SESSIONIZE and TD_SESSIONIZE_WINDOW both read event data, partition events into sessions associated with users, and assign unique UUIDs as session IDs to the rows.

However, TD_SESSIONIZE_WINDOW is implemented as a Trino(Presto) window function, so in calling it you use an OVER clause to specify the window over which events are aggregated, rather than passing that into the function directly as with TD_SESSIONIZE.

Comparing TD_SESSIONIZE and TD_SESSIONIZE_WINDOW

The original TD_SESSIONIZE() took three arguments:

Copy

Copied

string TD_SESSIONIZE(int/long unix_timestamp, int timeout, string sessionize_by)

The following example query sessionizes page view data from table web_logs, with path as the page visited, using :

column weblogs.time for the timestamp
3600 seconds (one hour) as the maximum session timeout
web _ logs.ip _ address as the user identifier:

Copy

Copied

SELECT TD_SESSIONIZE(time, 3600, ip_address)       as session_id,       time,       ip_address,       path FROM (      SELECT time, ip_address, path        from web_logs        order by ip_address, time ) t

The subquery that sorts the rows by ip_address and time. TD_SESSIONIZE required that the rows to be sessionized were pre-sorted on the sessionize_by field and the timestamp.

The new UDF TD_SESSIONIZE_WINDOW takes two arguments:

string TD_SESSIONIZE_WINDOW(int/long unix_timestamp, int timeout)

The time argument is specified as unix epoch time
The timeout argument is specified in seconds

As a Trino(Presto) window function, however, it is used in a SELECT statement with an OVER clause, supplying:

A PARTITION BY clause to partition the input rows on the user identifier field
An ORDER BY clause to sort the rows by the time column

Like TD_SESSIONIZE, TD_SESSIONIZE_WINDOW generates a unique session ID (UUID) for the first row in a session, and returns that value for all rows in the session, until it finds a separation in time between rows greater than the timeout value.

To rewrite a query that uses TD_SESSIONIZE to use TD_SESSIONIZE_WINDOW, :

For the PARTITION BY key for TD _ SESSIONIZE _ WINDOW, use the third argument to TD _ SESSIONIZE
For the ORDER BY key for TD _ SESSIONIZE _ WINDOW, use the time column

The equivalent to the previous sample query is:

Copy

Copied

SELECT TD_SESSIONIZE_WINDOW(time, 3600)  OVER (PARTITION BY ip_address ORDER BY time) as session_id, time, ip_address, path FROM web_logs;

Here is another example based on an actual customer query:

Copy

Copied

select time , td_sessionize(time, 10800, user_id) as session_id , company_id from ( select     time,      cast(user_id as varchar) as user_id,     company_id from access_company where user_id is not null order by cast(user_id as varchar) , time ) ;

When rewritten the query becomes:

Copy

Copied

-- Rewrite Create table session_table AS select     time,      td_sessionize_window(time, 10800)    OVER(     PARTITION BY user_id ORDER BY time)   as session_id,    company_id  from (     select time,         cast(user_id as varchar) as user_id,         company_id     from access_company     where user_id is not null) ;

How can I compute a median in Trino(Presto)

Trino(Presto) provides an approx_percentile function, to return the middle value of the sorted list. It does not provide a median function.

Copy

Copied

SELECT approx_percentile(price, 0.5) FROM nasdaq

How can I concatenate strings

Copy

Copied

SELECT 'hello ' || 'presto'

This returns string ‘hello presto’.

How can I supply a default value if the column contains NULLs

COALESCE(v1, v2, ..) gives the first non-null values from v1, v2, …:

Copy

Copied

-- This returns 'N/A' if name value is null
SELECT COALESCE(name, 'N/A') FROM table1

How can I take only a max or min value of two values

Copy

Copied

SELECT greatest(5, 10) -- returns 10