pyTD Quickstart

pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Trino(Presto) query engine, and Plazma primary storage. The seamless connection allows your Python code to efficiently read and write a large volume of data to and from Treasure Data.

We recommend using iPython with this library for the best experience during development.

Info

Treasure Data is no longer accepting new users for the Plazma Public API.

Setup

Installing

Copy

Copied

pip install pytd

Initializing Client

Set your TD API key and endpoint to the environment variables, TD_API_KEY and TD_API_SERVER, respectively, and create a client instance.

You can also optionally set the query engine to use for all queries. The default query engine is Trino(Presto).

Copy

Copied

import pytd

client = pytd.Client(database='sample_datasets')

# or, hard-code your API key, endpoint, and/or query engine:
pytd.Client(apikey='X/XXX', 
            endpoint='https://api.treasuredata.com/', 
            database='sample_datasets', 
            default_engine='presto')

Info

Note that the default engine is set to presto. To use Hive you can either

Set the default_engine='hive' for all queries at initialization.
Pass the engine='hive' parameter at run time.

Basic Use

Running a Query

All queries in Treasure Data either run on Trino(Presto) or Hive. You can specify the query engine at client initialization or per query at run time.

The default engine is Trino(Presto).

Trino(Presto) QueryHive Query

Copy

Copied

client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
# {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}

Copy

Copied

client.query('select hivemall_version()', engine='hive')
# {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)

Loading a DataFrame from Treasure Data

When pulling data from Treasure Data all data must be represented as pandas.DataFrame. To do this we will use functions from the pandas_td sub library.

Load Table

To read a Treasure Data table into a pandas DataFrame use the pandas read_td_table function.

Note that engine is a Class returned by the create_engine function.

BasicAdvanced

Copy

Copied

import pytd.pandas_td as td

# Assumes TD_API_KEY and TD_API_SERVER env variables are set
engine = td.create_engine("presto:my_db")

df = td.read_td_table('table_name', engine, limit=10000)

Copy

Copied

import pytd.pandas_td as td

# Declare API Key and Endpoint manually
con = td.connect(apikey='XXX', endpoint="https://api.treasuredata.com")
engine = td.create_engine("presto:my_db", con=con)

df = td.read_td_table('table_name', engine, limit=10000)

Load from Query

To run a query and load the output into a dataframe use the read_td_query function.

Copy

Copied

query='SELECT foo FROM bar'
df = td.read_td_query(query, 
                      engine, 
                      index_col=None, 
                      parse_dates=None, 
                      distributed_join=False, 
                      params=None):

read_td_query takes the following parameters:

query ( String ) — SQL string to be executed
engine — For example, presto_engine
index_col
parse_dates ( Array , none ) — For Array, column names must be given to parse as dates.
distributed_join ( true , false ) — (Trino(Presto) only) If true, distributed join is enabled. If false (default), broadcast join is used.
Params

The priority parameter can be set by using syntax similar to the following:

Copy

Copied

df_2 = td.read_td('SELECT foo FROM bar',
                  engine, 
                  params={'priority':1})
# Note priority is ordered highest 2,1,0,-1,-2 lowest

Writing DataFrame to Treasure Data

To write data to Treasure Data you simply use the load_table_from_dataframe function to write a locally defined pandas dataframe to a table.

Copy

Copied

import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})

client.load_table_from_dataframe(df, 
                                'table_name', 
                                writer='bulk_import', 
                                if_exists='overwrite')

writer parameter

For the writer parameter, pytd supports different ways to ingest data to Treasure Data:

bulk_import - Convert data into a CSV file and upload in the batch fashion.
insert_into - Insert every single row in DataFrame by issuing an INSERT INTO query through the Trino(Presto) query engine. Recommended only for a small volume of data.

Characteristics of each of these methods can be summarized as follows:

	bulk_import	insert_into
Scalable against data volume	✓
Write performance for larger data
Memory efficient	✓	✓
Disk efficient		✓
Minimal package dependency	✓	✓

if_exists parameter

For the if_exists parameter, pytd supports 4 different behavior patterns.

error : raise an exception
overwrite : drop it, recreate it, and insert data
append : insert data (create if does not exist)
ignore : do nothing

Advanced Use

For more advanced examples see the sample code on Google Colaboratory.

pyTD Quickstart

Info

Setup

Installing

Initializing Client

Info

Basic Use

Running a Query

Loading a DataFrame from Treasure Data

Load Table

Load from Query

Writing DataFrame to Treasure Data

Advanced Use

Further Reading