Data Tanks Using Presto

Use Presto, via TD Console, to query data stored in your Treasure Data Data Tank.

This optional feature enables users of Data Tanks to run queries against their Data Tank instance directly from the Presto query engine.

What is Data Tank?

Data Tanks provide easy access to your aggregated metrics through convenient, fully hosted data marts on Treasure Data’s core platform. They can be used to drive a variety of external business intelligence and visualization applications without having to host and maintain your own PostgreSQL instances.

Data Tanks are hosted PostgreSQL databases that serve as data marts. They are completely managed by Treasure Data including creation, setup, monitoring, management and troubleshooting so you can just get your job done.

Benefits

Use of Data Tanks provides the following key benefits:

  • Enables easier access to Data Tank data via querying from TD Console, via Presto. Including the creation and deletion of tables.
  • More flexible processing flows by enabling data to be joined across Data Tank and core TD tables.

Learn More

Limitations on Data Tank Queries from Presto

When creating or running Data Tank queries from Presto the following limitations apply:

  • Lists of tables or table preview functions are not supported in the query editor. You can still explore tables and preview tables by issuing queries against Data Tank system tables and normal tables.
  • You cannot access tables that use mixed case. Presto is a lowercase character system. Accessing case-sensitive systems like Postgres through Presto is a known issue.
  • Presto doesn't support numeric types in data tanks. To use decimal data in data tanks with Presto, specify the "real" or "double precision" type.
  • Queries from Presto against Data Tank tables that use aggregates (COUNT, MAX, SUM, etc.) may not generate optimal SQL for PostgreSQL. Presto does not push the aggregation down to the Data Tank PostgreSQL environment. Instead, all the rows needed to compute the aggregate are returned to Presto, and the aggregation is handled on a single node. If the source table is large enough, the required data movement and computation takes too long and TD may kill the query with a message like:
    Copy
    Copied
    [INTERNAL_FAILURE] Killed by the system because this query stalled for more than 1.00h. This is usually caused by a bug of Presto. The query execution will be retried in several minutes.Please ask support@treasure-data.com if you need a further help for this query

    To workaround the limitation, create a view in PostgreSQL that processes the aggregate on the PostgreSQL side, and query that view from the Presto side.

For example, a query against a 100M row Data Tank table giant_table is likely to fail:

Copy
Copied
select COUNT(*) as giant_count fromdatatank.public.giant_table;

A more efficient approach is to create a view in PostgreSQL that performs the resource-intensive processing in Data Tank:

Copy
Copied
create view count_view as select count(*) as cnt fromgiant_table;

Then execute a query on the view from Presto:

Copy
Copied
select cnt as giant_count from datatank.public.count_view;

Data Tank User Permissions for Presto Queries

When this querying functionality is turned on, Account Owner and Admin users are given Full Access permission.

All other users can be granted:

Permission Type Description
Full Access includes DML and DDL query usage
Query Only includes only DML query usage

Granting Access for Data Tank Queries from Presto

The data tank database is only available if your account has had this feature turned on by request. Administrators can grant these access levels.

  1. Make sure the Data Tank queries from Presto feature has been enabled for your account.
  2. Log in to the TD Console as an Administrator.
  3. Navigate to Settings > Team. For example, the Team section of TD Console.
  4. Select the data tank database.
  5. Specify the level of access you want to give.

Data Tank Query Statement Reference for Presto

Use Presto, via TD Console, to query data stored in your TD Data Tank.

This optional feature enables users of Data Tanks to run queries against their Data Tank instance directly from the Presto query engine. Contact your Customer Success representative or Support to learn more about adding this feature.

Data Tank Querying Reference (DML Queries)

SELECT query against a table

Copy
Copied
select avg(close) from datatank.public.normal_nasdaq where symbol = 'AAPL';
       _col0
--------------------
 14.823083442853807
(1 row)

SELECT query that joins Data Tank and TD Core tables

Copy
Copied
-- sample_datasets.nasdaq is a TD table
-- sample_datasets. can be left out if you specify it at '-d' option in command-line or select it at the database dropdown list
select count(distinct time) 
  from datatank.public.normal_nasdaq dt_nasdaq JOIN sample_datasets.nasdaq td_nasdaq 
  ON dt_nasdaq.symbol = td_nasdaq.symbol AND dt_nasdaq.open = td_nasdaq.open AND dt_nasdaq.close = td_nasdaq.close 
where dt_nasdaq.symbol = 'AAPL';

INSERT INTO a table in the data tank

Copy
Copied
insert into datatank.public.ms_nasdaq (symbol, open, volume, high, low, close, time) values ('AAPL', 10, 100, 20, 5, 7, td_scheduled_time());

INSERT INTO table from TD Table

Copy
Copied
insert into datatank.public.ms_nasdaq select * from sample_datasets.nasdaq where symbol = 'AAPL' ;

DML Queries that are not supported in Data Tanks

  • UPDATE against table in Data Tank (Not supported in Presto). Use Result Output to Data Tank for this purpose.
  • DELETE against table in Data Tank (Not supported in Presto PostgreSQL connector)

Data Tank Database Management Queries (DDL Queries)

List schemas

Copy
Copied
show schemas from datatank;
   Schema
--------------------
 information_schema
 pg_catalog
 public

List tables in schema

Copy
Copied
show tables from datatank.public ;
     Table
---------------
 normal_nasdaq

List columns of a table

Copy
Copied
desc datatank.public.normal_nasdaq;
 Column |  Type   | Comment
--------+---------+---------
 symbol | varchar |
 open   | double  |
 volume | bigint  |
 high   | double  |
 low    | double  |
 close  | double  |

CREATE TABLE AS in Data Tank from TD Table data

Copy
Copied
create table datatank.public.ms_nasdaq as select * from sample_datasets.nasdaq where symbol = 'MSFT' ;

DROP table

Copy
Copied
drop table datatank.public.ms_nasdaq;

DDL Queries that are not Supported in Data Tanks

  • Plain CREATE TABLE. (However CREATE TABLE AS SELECT is supported).
  • ALTER TABLE statement
  • DROP FOREIGN TABLE statement

Data Tank Presto FAQ

How Does Presto Query Processing Flow

A typical processing flow for a Presto query that joins data from a TD table with data from a Data Tank table is:

  1. Writes a query in Presto (within the TD Console)
  2. The Presto query begins running
  3. The Presto query generates and submits a sub-query to the Data Tank PostgreSQL instance
  4. The Data Tank instance runs that query, and returns the results to the Presto cluster
  5. The Presto cluster finishes running the query, completing the join of the data from the core TD dataset and the data returned from the PostgreSQL query result
  6. The query result is displayed in the TD Console. It can be submitted to a result output job, or downloaded from the browser, just like the result of any other Presto query

Can I issue queries that join data across TD core and Data Tank?

You can create queries that join TD core stored data and Data Tank stored data.

What happens when someone runs TD_TIME_RANGE(), or other TD specific UDFs, against data in Data Tank?

Presto and Treasure Data specific functions are run only after data is returned from the Data Tank instance. As such, TD_TIME_RANGE() or other functions are not “pushed down” the Data Tank query issued from Presto.

In general, we recommend not using such functions because it’s more performant to run queries using PostgreSQL friendly syntax. By using PostgreSQL friendly syntax, much of the processing can be completed on the Data Tank instance.

If you do run a TD or Presto specific function, the query still runs, but it may be less performant than if the WHERE function is effectively “pushed down” into the PostgreSQL query issued from Presto.

How are Data Tanks converted if the data type doesn’t exist in Treasure Data core? Like timestamp, float, varchar.

Presto as a query engine supports more data types than TD’s core tables. As such, when querying against PostgreSQL, our presto query engine is able to handle types such as a timestamp. Full type support by Presto can be seen here: https://trino.io/docs/350/language/types.html

If you are joining data, you may need to run a type conversion function to ensure the TD data type aligns with the data type you are joining from the Data Tank instance. But, you may need to run a type conversion when joining against any other 2 tables as well.