Writing Presto Queries
The following information can help you learn how to write Presto queries.
Presto Options
Consider the following Presto features when creating queries to run in Treasure Data.
Use the WITH Clause for Nested Queries
The WITH clause is useful for nested queries as shown in this example query:
SELECT
a,
b,
c
FROM (
SELECT
a,
MAX(b) AS b,
MIN(c) AS c
FROM
tbl
GROUP BY
a
) tbl_alias
The same query can be written with the WITH clause as follows:
WITH tbl_alias AS(
SELECT
a,
MAX(b) AS b,
MIN(c) AS c
FROM
tbl
GROUP BY
a
) SELECT
a,
b,
c
FROM
tbl_alias
Notice that the query has been “de-nested”.
The following example shows multiple subqueries use WITH. Each subquery is delimited by a comma.
WITH tbl1 AS(
SELECT
a,
MAX(b) AS b,
MIN(c) AS c
FROM
tbl
GROUP BY
a
),
tbl2 AS(
SELECT
a,
AVG(d) AS d
FROM
another_tbl
GROUP BY
a
) SELECT
tbl1. * ,
tbl2. *
FROM
tbl1
JOIN
tbl2
ON tbl1.a = tbl2.a
Using VALUES for Prototyping
If you want to quickly test Presto syntax, you can use VALUES to create a table immediately.
SELECT
a,
b,
c
FROM (
VALUES(
1,
2.0,
'x'
),
(
3,
4.0,
'y'
),
(
5,
6.0,
'z'
)
) tbl(
a,
b,
c
)
The above query returns the following table:
a | b | c |
---|---|---|
1 | 2.0 | 'x' |
3 | 4.0 | 'y' |
5 | 6.0 | 'z' |
Clean up a Table
To clean up a table before using CREATE TABLE AS or INSERT INTO statements, use multiple statements split by semi-colon.
DROP
TABLE
IF EXISTS mytable
;
CREATE
TABLE
mytable AS SELECT
. . .
;
ALTER TABLE - DROP COLUMN
You can include the SQL DDL statement ALTER TABLE...DROP COLUMN SQL in your Treasure Data queries to, for example, deduplicate data.
An example is as follows:
DROP
TABLE
IF EXISTS task_temp
;
CREATE
TABLE
task_temp AS SELECT
*
FROM
(
SELECT
* ,
row_number(
) over(
partition BY id
ORDER BY
TD_TIME_PARSE(lastmodifieddate) DESC
) AS rnk
FROM
task_raw
)
WHERE
rnk = 1
AND isdeleted = 0
ORDER BY
lastmodifieddate DESC
;
ALTER TABLE
task_temp DROP
COLUMN rnk
;
Limitation in Presto on Multiple Updates
Multiple-statement execution is not guarded by a transaction, therefore never write multiple update operations in a single job. For example, if you write two or more INSERT INTO statements in a single job, it may produce duplicated records:
# I1 INSERT
INTO
mytable SELECT
. . .
;
# I2 INSERT
INTO
mytable SELECT
. . .
;
If the system finds a sporadic failure during the job execution, it will start the job from the scratch, so the actual execution could be I1 (success), I2 (failure), (retry from the scratch), I1 (success) and I2 (success). In this case, I1 was executed twice and I2 was executed only once. The result is duplicated records in the target table.
Improving Performance of Very Large Presto Queries
You can parallelize the query result output process by using the CREATE TABLE AS SELECT statement. If you DROP the table before running the query, your performance is significantly better. The result output performance will be 5x faster than running SELECT *. Treasure Data Presto skips the JSON output process and directly produces a 1-hour partitioned table.
Without using DROP TABLE, Presto uses JSON text to materialize query results. And if the result table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, output of theJSON results takes a long time.
To clean up the result table beforehand:
- Add a DROP TABLE statement at the top of your query.
- Use CREATE TABLE (table) AS SELECT …
For example, your query might look like this:
DROP TABLE IF EXISTS my_result;
CREATE TABLE my_result AS
SELECT * FROM my_table;
Supported Time Formats in TD TIME FORMAT UDF
This is a summary of the time formats supported by the
TD_TIME_FORMAT_UDF
udf.
Formats
Syntax | Date or Time Component | Presentation | Examples |
---|---|---|---|
G | Era designator | Text | AD |
yyyy | Year | Year | 1996 |
yy | Year | Year (2 digits) | 96 |
MMMM | Month in year | Month long name | July |
MMM | Month in year | Month short name | Jul |
MM, M | Month in year | Month number | 07 |
ww, w | Week in year | Number | 06 |
DDD, DD, D | Day in year | Number | 189 |
dd, d | Day in month | Number | 10 |
EEEE | Day in week | Text | Tuesday |
E, EEE | Day in week | Text (short form) | Tue |
a | Am/pm marker | Text | PM |
HH, H | Hour in day (0-23) | Number | 00 |
kk, k | Hour in day (1-24) | Number | 24 |
KK, K | Hour in AM/PM (0-11) | Number | 00 |
hh, h | Hour in AM/PM (1-12) | Number | 12 |
mm, m | Minute in hour | Number | 30 |
ss, s | Second in minute | Number | 55 |
SSS, SS, S | Millisecond | Number | 978 |
wyear | Week of the year | Number | TD_TIME_FORMAT(1582859260, 'wyear' TD_TIME_FORMAT(1582859260, 'wyear', 'UTC')) |
zzzz | Time zone | Zone long name | Pacific Standard Time, or GMT+01:00 |
z | Time zone | Zone short name | PST, or GMT+01:00 |
Z | Time zone | Zone offset | -0800 |
u | Day number of week (1-7) | Number | 1 (for Monday) |
Example
This example formats a UNIX timestamp into a date formatted string. You can use the sample_datasets database for example to try it out:
SQL
SELECT
time,
TD_TIME_FORMAT(time,
'yyyy-MM-dd HH:mm:ss z')
FROM
www_access LIMIT 1
;
Result
2012-01-01 00: 00: 00 UTC
SQL DDL Command Differences in Hive and Presto
Data Definition Language (DDL) is a concept in SQL that can be used to create and modify the structure of database objects or schema objects. You can use DDL statements to create, manage, and delete tables, indexes, users, and other schema objects.
As it pertains to data-processing, Presto is optimized for quickly processing smaller volumes of information while Hive is better for querying more expansive databases requiring large amounts of memory. Both Hive and Presto use DDL commands but in slightly different ways.
The following table illustrates DDL commands required for Hive and Presto.
SQL DDL Commands and Their Equivalents
SQL DDL Commands | Hive SQL DDL | Presto SQL DDL | Query Definition |
---|---|---|---|
ALTER | ALTER | ALTER | Modifies the structure of an existing table. |
CREATE | CREATE | CREATE | Creates a database or objects including tables, views, stored procedures, etc. |
DROP | DROP | DROP | Deletes an existing database or an object within a database. |
TRUNCATE | TRUNCATE | DELETE | Removes all the records from a table. |
Presto Query Signatures
You can view query signatures from the Presto Utilization dashboard and use the query signatures to identify and group similar queries.
The Presto Utilization dashboard is available upon request for most customers.
Presto query signatures are clusters of Presto queries share common SQL statements. Similar jobs, jobs with only a few differences (such as the date range computed), are grouped together. You can then analyze the clusters of queries and determine which queries require the most or least resources in a given month.
The Presto Query Details table, which is posted on the Presto Utilization dashboard, includes a table column: Query Names or Signature.
Query Signature Definitions for Common Letters
Query Signatures are subject to change. The following list defines common query letters that help to decipher the query that the query signature represents:
Query Letter | Description |
---|---|
S | Select \|S\[\* \] means SELECT \* is used (retrieving all columns. This argument might be inefficient because our storage is columnar format.) |
Cnt | count(x) |
CntD | count(distinct x) (If query has this signature, it means a single node process = slow!) |
T | table |
G | Group by |
J | Join |
O | Order by (single node process = slow!) |
L | Limit |
CT | Create table |
I | Insert Into |
DT | Delete table |
D | Delete |
E | Duplicate Elimination (select distinct, union, etc.) |
U | Union All |
E(U(…)) | Union (with duplicate elimination) |
Example of Query Signatures
The following example shows SQL statements mapped to query signatures:
Query Signature | Results |
---|---|
SELECT 1 | S(V) |
SELECT * FROM t | S * t - > # |
INSERT INTO A SELECT * FROM B | B I( S * ) B - > A |
SELECT * FROM t, ‘query’ | S[*](T) |
SELECT * FROM t, ‘table’ | t - > # |
Presto Queries
What follows is not a complete set of SQL statements. These are just some basic SQL functions that you might use in your queries.
Info
Before you begin writing your queries, review Presto Query Runtime Limits and Query Hint Override.
Select Data
You can use the SELECT statement to select data from a database. The data returned is stored in a results table, called the result-set.
SELECT … FROM … WHERE … LIMIT
Example
SELECT contact_id, last_name, first_name
FROM contacts
WHERE website = "Company_Organization.com"
ORDER BY contact_id DESC
LIMIT 5;
Aggregate Data
Use aggregate functions to perform a calculation on one or more values and return a single value.
GROUP BY
Use the GROUP BY
command to group the result set (used with aggregate functions: COUNT, MAX, MIN, SUM, AVG).
Example
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country;
MIN()/MAX()
The MIN() function returns the smallest value of the selected column.
Example
SELECT MIN(columnname)
FROM table_name
WHERE condition;
The MAX() function returns the largest value of the selected column.
Example
SELECT MAX(column_name)
FROM table_name
WHERE condition;
MIN_BY/MAX_BY
MIN_BY
takes two arguments and returns the value of the first argument for which the value of the second argument is minimized.
SELECT MIN_BY(close_date, close_value) as date_of_min_sale
FROM sales_pipeline
MAX_BY
takes two arguments and returns the value of the first argument for which the value of the second argument is maximized.
APPROX_DISTINCT
APPROX_DISTINCT
calculates an approximate count of the number of distinct values.
Example
SELECT APPROX_DISTINCT(close_value)
FROM sales_pipeline
APPROX_PERCENTILE
APPROX_PRECENTILE
calculates the value at a given percentile of a distribution of values.
Example
SELECT APPROX_PERCENTILE(close_value, 0.75)
FROM sales_pipeline
Work with Time Ranges
Choose a Target Data Time Range
Use TD_TIME_RANGE and TD_INTERVAL.
Format Date Time Strings
Use Use TD_TIME_FORMAT and TD_TIME_STRING.
Truncate Time into Hour, Day, and Month
Use TD_DATE_TRUNC
Specify a Query Target Range
Create and Update Tables from SQL Results
The following SQL statements can help you create and update tables in Presto.
CREATE TABLE … AS
Create a new table containing the result of a select query. Use CREATE TABLE to create an empty table. You might want to use these optional clauses:
-
IF NOT EXISTS
clause causes the error to be suppressed if the table already exists. -
WITH
clause can be used to set properties on the newly created table.
Example
CREATE TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ]
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]
INSERT INTO
Insert data into an existing table. The following example uses person
as the name of the table.
Example
insert into person values (1, 'james', 10, '20190301');
Delete Records
Delete rows from a table. If you specify a WHERE
clause, only the matching rows are deleted. Otherwise, all rows from the table are deleted.
The WHERE clause also helps reduce unintentional DELETE commands.
DELETE statements are helpful for:
- Cleaning up error records -Removing duplicate records
- Removing unnecessary or unexpected results from result tables
Limitations
Review Delete Statement Syntax and known limitations for the Presto Delete Statement.
Syntax
DELETE FROM <table_name> [ WHERE <condition> ]
Where:
- <table _ name> is the name of the table for deletion
-
is the clause to specify a search condition
Examples Delete all products shipped by boat.
DELETE FROM products WHERE shipmode = 'BOAT';
Delete all products for low priority orders.
DELETE FROM products
WHERE orderkey IN (SELECT orderkey FROM orders WHERE priority = 'LOW');
Delete all products.
DELETE FROM products;
Perform Window Functions
Window functions perform calculations across rows of the query result. They run after the HAVING
clause but before the ORDER BY
clause. Invoking a window function requires special syntax using the OVER
clause to specify the window. A window has these three components:
-
Partition Specification
. Separates the input rows into different partitions. This is analogous to how the
GROUP BY
clause separates rows into different groups for aggregate functions. - Ordering Specification . Determines the order in which input rows are processed by the window function.
-
Window Frame
. Specifies a sliding window of rows to be processed by the function for a given row. If the frame is not specified, it defaults to
RANGE UNBOUNDED PRECEDING</code>, which is the same as
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW . This frame contains all rows from the start of the partition up to the last peer of the current row.
For example, the following query ranks orders for each clerk by price.
Example
SELECT orderkey, clerk, totalprice,
rank() OVER (PARTITION BY clerk
ORDER BY totalprice DESC) AS rnk
FROM orders
ORDER BY clerk, rnk
Access Array Type Data
ARRAY_AGG
You can use the aggregate function, array_agg, with ARRAY data types that allow arrays to be created as part of the SELECT list of a query.
Example
The following target table contains one row per employee and stores all phone numbers associated with an employee in an ARRAY type.
CREATE SET TABLE employeePhoneInfo
(emp_id INTEGER,
emp_name VARCHAR(30),
emp_phone emp_phone_array);
INSERT INTO employeePhoneInfo
SELECT emp_id, emp_name,
ARRAY_AGG(emp_phone, NEW emp_phone_array())
FROM employee GROUP BY emp_id,emp_name
WHERE emp_id < 100;
AWS Lambda
AWS Lambda is a compute service that lets you run code without provisioning or managing servers.
Example
This example creates an array (fruits). The has_elements
flag identifies whether or not more elements need to be processed.
exports.handler = (event, context, callback) =>
var fruits = ['apple', 'orange', 'pinaple'];
event.fruits = fruits;
if (event.fruits.length > 0)
event.has_elements = true;
else
event.has_elements = false;
callback(null, event);
;
This example processes the array elements. We want to remove processed elements from the array. The .shift()
method removes the first element from the array.
exports.handler = (event, context, callback) =>
let fruit = event.fruits.shift();
// Process array element
console.log('Processing fruit: ' + fruit);
// Array still has elements:
if (event.fruits.length > 0)
event.has_elements = true;
else
event.has_elements = false;
// Log array elements (for demonstration purpose)
console.log('Elements in array: ' + event.fruits);
console.log('Array has more elements: ' + event.has_elements);
callback(null, event);
;
Process Data
Data Type Conversion Functions
Presto implicity converts numeric and character values to the correct type if such a conversion is possible. Presto does not convert between character and numeric types. For example, a query that expects a varchar
does not automatically convert a bigint
value to an equivalent varchar
.
When necessary, values can be explicitly cast to a particular type.
Conversion Functions
cast</code>(<em>value AS type</em>) → type
Explicitly cast a value as a type. This can be used to cast a varchar to a numeric value type and vice versa.
try_cast</code>(<em>value AS type</em>) → type
Like cast()
, but returns null if the cast fails.
String Functions
You can write string functions to extract, find like, replace, and split data.
Learn more about Presto Regular Expression Functions.
Process JSON
Use JSON functions and operators process data.
Learn more about Presto JSON Functions and Operators
Numbers (integer, float)
There are many mathematical operators available in Presto that you can use to process data.
Here are a few you might find useful:
-
ceil</code>(_x_) → [same as input]
This is an alias for
ceiling()</code>
. -
floor</code>(<em>x</em>) → [same as input]
Returns
x
rounded down to the nearest integer. -
round</code>(<em>x</em>, <em>d</em>) → [same as input]
Returns
x
rounded tod
decimal places
To review all mathematical operators, see Presto Documentation: Mathematical Functions and Operators.