Data Tanks Using Trino(Presto)

Use Trino(Presto), via TD Console, to query data stored in your Treasure Data Data Tank.

This optional feature enables users of Data Tanks to run queries against their Data Tank instance directly from the Trino(Presto) query engine.

What is Data Tank?

Data Tanks provide easy access to your aggregated metrics through convenient, fully hosted data marts on Treasure Data’s core platform. They can be used to drive a variety of external business intelligence and visualization applications without having to host and maintain your own PostgreSQL instances.

Data Tanks are hosted PostgreSQL databases that serve as data marts. They are completely managed by Treasure Data including creation, setup, monitoring, management and troubleshooting so you can just get your job done.

Benefits

Use of Data Tanks provides the following key benefits:

Enables easier access to Data Tank data via querying from TD Console, via Trino(Presto). Including the creation and deletion of tables.
More flexible processing flows by enabling data to be joined across Data Tank and core TD tables.

Learn More

Limitations on Data Tank Queries from Trino(Presto)
Data Tank User Permissions for Trino(Presto) Queries
Granting Access for Data Tank Queries from Trino(Presto)
Data Tank Query Statement Reference for Trino(Presto)
Data Tank Trino(Presto) FAQ

Limitations on Data Tank Queries from Trino(Presto)

When creating or running Data Tank queries from Trino(Presto) the following limitations apply:

Lists of tables or table preview functions are not supported in the query editor. You can still explore tables and preview tables by issuing queries against Data Tank system tables and normal tables.
You cannot access tables that use mixed case. Trino(Presto) is a lowercase character system. Accessing case-sensitive systems like Postgres through Trino(Presto) is a known issue.
Trino(Presto) doesn't support numeric types in data tanks. To use decimal data in data tanks with Trino(Presto), specify the "real" or "double precision" type.
Queries from Trino(Presto) against Data Tank tables that use aggregates (COUNT, MAX, SUM, etc.) may not generate optimal SQL for PostgreSQL. Trino(Presto) does not push the aggregation down to the Data Tank PostgreSQL environment. Instead, all the rows needed to compute the aggregate are returned to Trino(Presto), and the aggregation is handled on a single node. If the source table is large enough, the required data movement and computation takes too long and TD may kill the query with a message like:
Copy

Copied
```
[INTERNAL_FAILURE] Killed by the system because this query stalled for more than 1.00h. This is usually caused by a bug of Presto. The query execution will be retried in several minutes.Please ask support@treasure-data.com if you need a further help for this query
```
To workaround the limitation, create a view in PostgreSQL that processes the aggregate on the PostgreSQL side, and query that view from the Trino(Presto) side.

For example, a query against a 100M row Data Tank table giant_table is likely to fail:

Copy

Copied

select COUNT(*) as giant_count fromdatatank.public.giant_table;

A more efficient approach is to create a view in PostgreSQL that performs the resource-intensive processing in Data Tank:

Copy

Copied

create view count_view as select count(*) as cnt fromgiant_table;

Then execute a query on the view from Trino(Presto):

Copy

Copied

select cnt as giant_count from datatank.public.count_view;

Data Tank User Permissions for Trino(Presto) Queries

When this querying functionality is turned on, Account Owner and Admin users are given Full Access permission.

All other users can be granted:

Permission Type	Description
Full Access	includes DML and DDL query usage
Query Only	includes only DML query usage

Granting Access for Data Tank Queries from Trino(Presto)

The data tank database is only available if your account has had this feature turned on by request. Administrators can grant these access levels.

Make sure the Data Tank queries from Trino(Presto) feature has been enabled for your account.
Log in to the TD Console as an Administrator.
Navigate to Settings > Team. For example, the Team section of TD Console.
Select the data tank database.
Specify the level of access you want to give.

Data Tank Query Statement Reference for Trino(Presto)

Use Trino(Presto), via TD Console, to query data stored in your TD Data Tank.

This optional feature enables users of Data Tanks to run queries against their Data Tank instance directly from the Trino(Presto) query engine. Contact your Customer Success representative or Support to learn more about adding this feature.

Data Tank Querying Reference (DML Queries)
Data Tank Database Management Queries (DDL Queries)

Data Tank Querying Reference (DML Queries)

SELECT query against a table

Copy

Copied

select avg(close) from datatank.public.normal_nasdaq where symbol = 'AAPL';
       _col0
--------------------
 14.823083442853807
(1 row)

SELECT query that joins Data Tank and TD Core tables

Copy

Copied

-- sample_datasets.nasdaq is a TD table
-- sample_datasets. can be left out if you specify it at '-d' option in command-line or select it at the database dropdown list
select count(distinct time) 
  from datatank.public.normal_nasdaq dt_nasdaq JOIN sample_datasets.nasdaq td_nasdaq 
  ON dt_nasdaq.symbol = td_nasdaq.symbol AND dt_nasdaq.open = td_nasdaq.open AND dt_nasdaq.close = td_nasdaq.close 
where dt_nasdaq.symbol = 'AAPL';

INSERT INTO a table in the data tank

Copy

Copied

insert into datatank.public.ms_nasdaq (symbol, open, volume, high, low, close, time) values ('AAPL', 10, 100, 20, 5, 7, td_scheduled_time());

INSERT INTO table from TD Table

Copy

Copied

insert into datatank.public.ms_nasdaq select * from sample_datasets.nasdaq where symbol = 'AAPL' ;

DML Queries that are not supported in Data Tanks

UPDATE against table in Data Tank (Not supported in Trino(Presto)). Use Result Output to Data Tank for this purpose.
DELETE against table in Data Tank (Not supported in Trino(Presto) PostgreSQL connector)

Data Tank Database Management Queries (DDL Queries)

List schemas

Copy

Copied

show schemas from datatank;
   Schema
--------------------
 information_schema
 pg_catalog
 public

List tables in schema

Copy

Copied

show tables from datatank.public ;
     Table
---------------
 normal_nasdaq

List columns of a table

Copy

Copied

desc datatank.public.normal_nasdaq;
 Column |  Type   | Comment
--------+---------+---------
 symbol | varchar |
 open   | double  |
 volume | bigint  |
 high   | double  |
 low    | double  |
 close  | double  |

CREATE TABLE AS in Data Tank from TD Table data

Copy

Copied

create table datatank.public.ms_nasdaq as select * from sample_datasets.nasdaq where symbol = 'MSFT' ;

DROP table

Copy

Copied

drop table datatank.public.ms_nasdaq;

DDL Queries that are not Supported in Data Tanks

Plain CREATE TABLE. (However CREATE TABLE AS SELECT is supported).
ALTER TABLE statement
DROP FOREIGN TABLE statement

Data Tank Trino(Presto) FAQ

How Does Trino(Presto) Query Processing Flow?
Can I issue queries that join data across TD core and Data Tank?
What happens when someone runs TD_TIME_RANGE(), or other TD specific UDFs, against data in Data Tank?
How are Data Tanks converted if the data type doesn’t exist in Treasure Data core? Like timestamp, float, varchar.

How Does Trino(Presto) Query Processing Flow

A typical processing flow for a Trino(Presto) query that joins data from a TD table with data from a Data Tank table is:

Writes a query in Trino(Presto) (within the TD Console)
The Trino(Presto) query begins running
The Trino(Presto) query generates and submits a sub-query to the Data Tank PostgreSQL instance
The Data Tank instance runs that query, and returns the results to the Trino(Presto) cluster
The Trino(Presto) cluster finishes running the query, completing the join of the data from the core TD dataset and the data returned from the PostgreSQL query result
The query result is displayed in the TD Console. It can be submitted to a result output job, or downloaded from the browser, just like the result of any other Trino(Presto) query

Can I issue queries that join data across TD core and Data Tank?

You can create queries that join TD core stored data and Data Tank stored data.

What happens when someone runs TD_TIME_RANGE(), or other TD specific UDFs, against data in Data Tank?

Trino(Presto) and Treasure Data specific functions are run only after data is returned from the Data Tank instance. As such, TD_TIME_RANGE() or other functions are not “pushed down” the Data Tank query issued from Trino(Presto).

In general, we recommend not using such functions because it’s more performant to run queries using PostgreSQL friendly syntax. By using PostgreSQL friendly syntax, much of the processing can be completed on the Data Tank instance.

If you do run a TD or Trino(Presto) specific function, the query still runs, but it may be less performant than if the WHERE function is effectively “pushed down” into the PostgreSQL query issued from Trino(Presto).

How are Data Tanks converted if the data type doesn’t exist in Treasure Data core? Like timestamp, float, varchar.

Trino(Presto) as a query engine supports more data types than TD’s core tables. As such, when querying against PostgreSQL, our presto query engine is able to handle types such as a timestamp. Full type support by Trino(Presto) can be seen here: https://trino.io/docs/350/language/types.html

If you are joining data, you may need to run a type conversion function to ensure the TD data type aligns with the data type you are joining from the Data Tank instance. But, you may need to run a type conversion when joining against any other 2 tables as well.