Writing Presto Queries

The following information can help you learn how to write Presto queries.

Presto Options

Consider the following Presto features when creating queries to run in Treasure Data.

Use the WITH Clause for Nested Queries

The WITH clause is useful for nested queries as shown in this example query:

Copy

Copied

SELECT 
  a,
  b,
  c
FROM (
    SELECT 
      a,
      MAX(b) AS b,
      MIN(c) AS c
    FROM
      tbl
    GROUP BY
      a
  ) tbl_alias

The same query can be written with the WITH clause as follows:

Copy

Copied

WITH tbl_alias AS(
  SELECT 
    a,
    MAX(b) AS b,
    MIN(c) AS c
  FROM
    tbl
  GROUP BY
    a
) SELECT 
  a,
  b,
  c
FROM
  tbl_alias

Notice that the query has been “de-nested”.

The following example shows multiple subqueries use WITH. Each subquery is delimited by a comma.

Copy

Copied

WITH tbl1 AS(
  SELECT 
    a,
    MAX(b) AS b,
    MIN(c) AS c
  FROM
    tbl
  GROUP BY
    a
),
tbl2 AS(
  SELECT 
    a,
    AVG(d) AS d
  FROM
    another_tbl
  GROUP BY
    a
) SELECT 
  tbl1. * ,
  tbl2. *
FROM
  tbl1
JOIN
  tbl2
  ON tbl1.a = tbl2.a

Using VALUES for Prototyping

If you want to quickly test Presto syntax, you can use VALUES to create a table immediately.

Copy

Copied

SELECT 
  a,
  b,
  c
FROM (
  VALUES(
    1,
    2.0,
    'x'
  ),
  (
    3,
    4.0,
    'y'
  ),
  (
    5,
    6.0,
    'z'
  )
  ) tbl(
    a,
    b,
    c
  )

The above query returns the following table:

a	b	c
1	2.0	'x'
3	4.0	'y'
5	6.0	'z'

Clean up a Table

To clean up a table before using CREATE TABLE AS or INSERT INTO statements, use multiple statements split by semi-colon.

Copy

Copied

DROP
  TABLE
    IF EXISTS mytable
;

CREATE
  TABLE
    mytable AS SELECT 
      . . .
;

ALTER TABLE - DROP COLUMN

You can include the SQL DDL statement ALTER TABLE...DROP COLUMN SQL in your Treasure Data queries to, for example, deduplicate data.

An example is as follows:

Copy

Copied

DROP
  TABLE
    IF EXISTS task_temp
;

CREATE
  TABLE
    task_temp AS SELECT 
       *
    FROM
      (
        SELECT 
           * ,
          row_number(
          ) over(
            partition BY id
          ORDER BY
            TD_TIME_PARSE(lastmodifieddate) DESC
          ) AS rnk
        FROM
          task_raw
      )
    WHERE
      rnk = 1
      AND isdeleted = 0
    ORDER BY
      lastmodifieddate DESC
;

ALTER TABLE
  task_temp DROP
    COLUMN rnk
;

Limitation in Presto on Multiple Updates

Multiple-statement execution is not guarded by a transaction, therefore never write multiple update operations in a single job. For example, if you write two or more INSERT INTO statements in a single job, it may produce duplicated records:

Copy

Copied

# I1 INSERT
  INTO
    mytable SELECT 
      . . .
;

# I2 INSERT
  INTO
    mytable SELECT 
      . . .
;

If the system finds a sporadic failure during the job execution, it will start the job from the scratch, so the actual execution could be I1 (success), I2 (failure), (retry from the scratch), I1 (success) and I2 (success). In this case, I1 was executed twice and I2 was executed only once. The result is duplicated records in the target table.

Improving Performance of Very Large Presto Queries

You can parallelize the query result output process by using the CREATE TABLE AS SELECT statement. If you DROP the table before running the query, your performance is significantly better. The result output performance will be 5x faster than running SELECT *. Treasure Data Presto skips the JSON output process and directly produces a 1-hour partitioned table.

Without using DROP TABLE, Presto uses JSON text to materialize query results. And if the result table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, output of theJSON results takes a long time.

To clean up the result table beforehand:

Add a DROP TABLE statement at the top of your query.
Use CREATE TABLE (table) AS SELECT …

For example, your query might look like this:

Copy

Copied

DROP TABLE IF EXISTS my_result;
CREATE TABLE my_result AS 
    SELECT * FROM my_table;

Supported Time Formats in TD TIME FORMAT UDF

This is a summary of the time formats supported by the TD_TIME_FORMAT_UDF udf.

Formats

Syntax	Date or Time Component	Presentation	Examples
G	Era designator	Text	AD
yyyy	Year	Year	1996
yy	Year	Year (2 digits)	96
MMMM	Month in year	Month long name	July
MMM	Month in year	Month short name	Jul
MM, M	Month in year	Month number	07
ww, w	Week in year	Number	06
DDD, DD, D	Day in year	Number	189
dd, d	Day in month	Number	10
EEEE	Day in week	Text	Tuesday
E, EEE	Day in week	Text (short form)	Tue
a	Am/pm marker	Text	PM
HH, H	Hour in day (0-23)	Number	00
kk, k	Hour in day (1-24)	Number	24
KK, K	Hour in AM/PM (0-11)	Number	00
hh, h	Hour in AM/PM (1-12)	Number	12
mm, m	Minute in hour	Number	30
ss, s	Second in minute	Number	55
SSS, SS, S	Millisecond	Number	978
wyear	Week of the year	Number	TD_TIME_FORMAT(1582859260, 'wyear' TD_TIME_FORMAT(1582859260, 'wyear', 'UTC'))
zzzz	Time zone	Zone long name	Pacific Standard Time, or GMT+01:00
z	Time zone	Zone short name	PST, or GMT+01:00
Z	Time zone	Zone offset	-0800
u	Day number of week (1-7)	Number	1 (for Monday)

Example

This example formats a UNIX timestamp into a date formatted string. You can use the sample_datasets database for example to try it out:

SQL

Copy

Copied

SELECT 
  time,
  TD_TIME_FORMAT(time,
    'yyyy-MM-dd HH:mm:ss z')
FROM
  www_access LIMIT 1
;

Result

Copy

Copied

2012-01-01 00: 00: 00 UTC

SQL DDL Command Differences in Hive and Presto

Data Definition Language (DDL) is a concept in SQL that can be used to create and modify the structure of database objects or schema objects. You can use DDL statements to create, manage, and delete tables, indexes, users, and other schema objects.

As it pertains to data-processing, Presto is optimized for quickly processing smaller volumes of information while Hive is better for querying more expansive databases requiring large amounts of memory. Both Hive and Presto use DDL commands but in slightly different ways.

The following table illustrates DDL commands required for Hive and Presto.

SQL DDL Commands and Their Equivalents

SQL DDL Commands	Hive SQL DDL	Presto SQL DDL	Query Definition
ALTER	ALTER	ALTER	Modifies the structure of an existing table.
CREATE	CREATE	CREATE	Creates a database or objects including tables, views, stored procedures, etc.
DROP	DROP	DROP	Deletes an existing database or an object within a database.
TRUNCATE	TRUNCATE	DELETE	Removes all the records from a table.

Presto Query Signatures

You can view query signatures from the Presto Utilization dashboard and use the query signatures to identify and group similar queries.

The Presto Utilization dashboard is available upon request for most customers.

Presto query signatures are clusters of Presto queries share common SQL statements. Similar jobs, jobs with only a few differences (such as the date range computed), are grouped together. You can then analyze the clusters of queries and determine which queries require the most or least resources in a given month.

The Presto Query Details table, which is posted on the Presto Utilization dashboard, includes a table column: Query Names or Signature.

Query Signature Definitions for Common Letters
Example of Query Signatures

Query Signature Definitions for Common Letters

Query Signatures are subject to change. The following list defines common query letters that help to decipher the query that the query signature represents:

Query Letter	Description
S	Select `\\|S\[\* \]` means SELECT `\*` is used (retrieving all columns. This argument might be inefficient because our storage is columnar format.)
Cnt	count(x)
CntD	count(distinct x) (If query has this signature, it means a single node process = slow!)
T	table
G	Group by
J	Join
O	Order by (single node process = slow!)
L	Limit
CT	Create table
I	Insert Into
DT	Delete table
D	Delete
E	Duplicate Elimination (select distinct, union, etc.)
U	Union All
E(U(…))	Union (with duplicate elimination)

Example of Query Signatures

The following example shows SQL statements mapped to query signatures:

Query Signature	Results
SELECT 1	`S(V)`
SELECT * FROM t	`S * t - > #`
INSERT INTO A SELECT * FROM B	`B I(` `S *` `) B - > A`
SELECT * FROM t, ‘query’	`S[*](T)`
SELECT * FROM t, ‘table’	`t - > #`

Presto Queries

What follows is not a complete set of SQL statements. These are just some basic SQL functions that you might use in your queries.

Info

Before you begin writing your queries, review Presto Query Runtime Limits and Query Hint Override.

Select Data

You can use the SELECT statement to select data from a database. The data returned is stored in a results table, called the result-set.

SELECT … FROM … WHERE … LIMIT

Example

Copy

Copied

SELECT contact_id, last_name, first_name
FROM contacts
WHERE website = "Company_Organization.com"
ORDER BY contact_id DESC
LIMIT 5;

Aggregate Data

Use aggregate functions to perform a calculation on one or more values and return a single value.

GROUP BY

Use the GROUP BY command to group the result set (used with aggregate functions: COUNT, MAX, MIN, SUM, AVG).

Example

Copy

Copied

SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country;

MIN()/MAX()

The MIN() function returns the smallest value of the selected column.

Example

Copy

Copied

SELECT MIN(columnname)
FROM table_name
WHERE condition;

The MAX() function returns the largest value of the selected column.

Example

Copy

Copied

SELECT MAX(column_name)
FROM table_name
WHERE condition;

MIN_BY/MAX_BY

MIN_BY takes two arguments and returns the value of the first argument for which the value of the second argument is minimized.

Copy

Copied

SELECT MIN_BY(close_date, close_value) as date_of_min_sale
FROM sales_pipeline

MAX_BY takes two arguments and returns the value of the first argument for which the value of the second argument is maximized.

APPROX_DISTINCT

APPROX_DISTINCT calculates an approximate count of the number of distinct values.

Example

Copy

Copied

SELECT APPROX_DISTINCT(close_value)
  FROM sales_pipeline

APPROX_PERCENTILE

APPROX_PRECENTILE calculates the value at a given percentile of a distribution of values.

Example

Copy

Copied

SELECT APPROX_PERCENTILE(close_value, 0.75)
FROM sales_pipeline

Work with Time Ranges

Create and Update Tables from SQL Results

The following SQL statements can help you create and update tables in Presto.

CREATE TABLE … AS

Create a new table containing the result of a select query. Use CREATE TABLE to create an empty table. You might want to use these optional clauses:

IF NOT EXISTS clause causes the error to be suppressed if the table already exists.
WITH clause can be used to set properties on the newly created table.

Example

Copy

Copied

CREATE TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ]
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]

INSERT INTO

Insert data into an existing table. The following example uses person as the name of the table.

Example

Copy

Copied

insert into person values (1, 'james', 10, '20190301');

Delete Records

Delete rows from a table. If you specify a WHERE clause, only the matching rows are deleted. Otherwise, all rows from the table are deleted.

The WHERE clause also helps reduce unintentional DELETE commands.

DELETE statements are helpful for:

Cleaning up error records -Removing duplicate records
Removing unnecessary or unexpected results from result tables

Limitations

Review Delete Statement Syntax and known limitations for the Presto Delete Statement.

Syntax

Copy

Copied

DELETE FROM <table_name> [ WHERE <condition> ]

Where:

<table _ name> is the name of the table for deletion
is the clause to specify a search condition

Examples Delete all products shipped by boat.

Copy

Copied

DELETE FROM products WHERE shipmode = 'BOAT';

Delete all products for low priority orders.

Copy

Copied

DELETE FROM products
WHERE orderkey IN (SELECT orderkey FROM orders WHERE priority = 'LOW');

Delete all products.

Copy

Copied

DELETE FROM products;

Perform Window Functions

Window functions perform calculations across rows of the query result. They run after the HAVING clause but before the ORDER BY clause. Invoking a window function requires special syntax using the OVER clause to specify the window. A window has these three components:

Partition Specification . Separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions.
Ordering Specification . Determines the order in which input rows are processed by the window function.
Window Frame . Specifies a sliding window of rows to be processed by the function for a given row. If the frame is not specified, it defaults to RANGE UNBOUNDED PRECEDING</code>, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW . This frame contains all rows from the start of the partition up to the last peer of the current row.

For example, the following query ranks orders for each clerk by price.

Example

Copy

Copied

SELECT orderkey, clerk, totalprice,
       rank() OVER (PARTITION BY clerk
                    ORDER BY totalprice DESC) AS rnk
FROM orders
ORDER BY clerk, rnk

Access Array Type Data

ARRAY_AGG

You can use the aggregate function, array_agg, with ARRAY data types that allow arrays to be created as part of the SELECT list of a query.

Example

The following target table contains one row per employee and stores all phone numbers associated with an employee in an ARRAY type.

Copy

Copied

 CREATE SET TABLE employeePhoneInfo
      (emp_id INTEGER, 
       emp_name VARCHAR(30), 
       emp_phone emp_phone_array);
 
   INSERT INTO employeePhoneInfo 
      SELECT emp_id, emp_name, 
         ARRAY_AGG(emp_phone, NEW emp_phone_array()) 
      FROM employee GROUP BY emp_id,emp_name 
         WHERE emp_id < 100;

AWS Lambda

AWS Lambda is a compute service that lets you run code without provisioning or managing servers.

Example This example creates an array (fruits). The has_elements flag identifies whether or not more elements need to be processed.

Copy

Copied

exports.handler = (event, context, callback) => 
    var fruits = ['apple', 'orange', 'pinaple'];
    
    event.fruits = fruits;
    
    if (event.fruits.length > 0) 
        event.has_elements = true;
     else 
        event.has_elements = false;
    
    
    callback(null, event);
;

This example processes the array elements. We want to remove processed elements from the array. The .shift() method removes the first element from the array.

Copy

Copied

exports.handler = (event, context, callback) => 
    let fruit = event.fruits.shift();
    // Process array element
    console.log('Processing fruit: ' + fruit);
    
    // Array still has elements:
    if (event.fruits.length > 0) 
        event.has_elements = true;
     else 
        event.has_elements = false;
    
    
    // Log array elements (for demonstration purpose)
    console.log('Elements in array: ' + event.fruits);
    console.log('Array has more elements: ' + event.has_elements);
    
    callback(null, event);
;

Process Data

Data Type Conversion Functions

Presto implicity converts numeric and character values to the correct type if such a conversion is possible. Presto does not convert between character and numeric types. For example, a query that expects a varchar does not automatically convert a bigint value to an equivalent varchar.

When necessary, values can be explicitly cast to a particular type.

Conversion Functions

cast</code>(value AS type) → type

Explicitly cast a value as a type. This can be used to cast a varchar to a numeric value type and vice versa.

try_cast</code>(value AS type) → type

Like cast(), but returns null if the cast fails.

String Functions

You can write string functions to extract, find like, replace, and split data.

Learn more about Presto Regular Expression Functions.

Process JSON

Use JSON functions and operators process data.

Learn more about Presto JSON Functions and Operators

Numbers (integer, float)

There are many mathematical operators available in Presto that you can use to process data.

Here are a few you might find useful:

ceil</code>(_x_) → [same as input]
This is an alias for ceiling()</code>.
floor</code>(x) → [same as input]
Returns x rounded down to the nearest integer.
round</code>(x, d) → [same as input]
Returns x rounded to d decimal places

To review all mathematical operators, see Presto Documentation: Mathematical Functions and Operators.

Writing Presto Queries

Presto Options

Use the WITH Clause for Nested Queries

Using VALUES for Prototyping

Clean up a Table

ALTER TABLE - DROP COLUMN

Limitation in Presto on Multiple Updates

Improving Performance of Very Large Presto Queries

Supported Time Formats in TD TIME FORMAT UDF

SQL DDL Command Differences in Hive and Presto

Presto Query Signatures

Query Signature Definitions for Common Letters

Example of Query Signatures

Presto Queries

Info

Select Data

Aggregate Data

GROUP BY

MIN()/MAX()

MIN_BY/MAX_BY

APPROX_DISTINCT

APPROX_PERCENTILE

Work with Time Ranges

Choose a Target Data Time Range

Format Date Time Strings

Truncate Time into Hour, Day, and Month

Specify a Query Target Range

Create and Update Tables from SQL Results

CREATE TABLE … AS

INSERT INTO

Delete Records

Limitations

Syntax

Perform Window Functions

Access Array Type Data

ARRAY_AGG

AWS Lambda

Process Data

Data Type Conversion Functions

String Functions

Process JSON

Numbers (integer, float)