About the Hive Engine
Apache Hive is a data warehouse system built on top of Apache Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in various databases and file systems that integrate with Hadoop. Hive offers a simple way to apply structure to large amounts of unstructured data and then perform batch SQL-like queries on that data.
The Treasure Data Hive service (TD Hive) provides batch data processing on data stored in Treasure Data’s data lake, based on Apache Hive. TD eliminates the need to run your own Hadoop clusters to handle Hive processing. Instead, Treasure Data operates compute clusters for running Hive jobs.
The Customer Data Platform (CDP) application uses Hive jobs for some of its internal operations, and you can also run your own Hive jobs. You can submit SELECT or DML queries using Hive’s query language, using TD Console, API calls, the TD Toolbelt, or from TD workflows. The service queues, executes the queries, and returns the results. You can also design your system so that results are delivered to destinations specified in your Result Output.
The Hive query language (HiveQL) is the primary data processing method for Treasure Data. HiveQL is powered by Apache Hive
Treasure Data is a CDP that allows users to collect, store, and analyze their data on the cloud. Treasure Data manages its own Hadoop cluster, which accepts queries from users and executes them using the Hadoop MapReduce framework. HiveQL is one of the languages it supports.
Hive and HiveQL Differences
- TD Hive supports the flexible schema capabilities of the TD platform; schema can be inferred from data loaded into tables or can be set explicitly.
- Treasure Data supports HiveQL semantics, but unlike Apache Hive, Treasure Data allows the user to set and modify the schema at any time. We do not require that a table schema be defined upfront.
Choosing an Analytics Engine
The two most popular engines are Hive and Trino (Presto):
- Hive — Designed for querying of large data systems in the open-source Hadoop platform. Hive 2020.1 converts SQL-like queries into MapReduce jobs for easy execution and processing of extremely large volumes of data. Hive is optimized for query throughput and is described as a pull-model.
- Trino (Presto) — Both Trino 423 and Presto 350 are designed for fast, interactive queries on data in HDFS, and others. Trino (Presto) is optimized for latency and is often described as a push model.
Hive | Trino (Presto) | |
---|---|---|
Optimized for | Throughput | Interactivity |
SQL Standardized fidelity | HiveQL (a subset of common data warehousing SQL) | Designed to comply with ANSI SQL |
Window functions | Yes | Yes |
Large JOINs | Very good for large fact-to-fact joins | Optimized for star scheme joins (1 large fact table and many smaller dimension tables) |
Maximum Query Length | No Hard Limit | 1MB |
About Hive Query Engine
Apache Hive is a data warehouse system built on top of Apache Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in various databases and file systems that integrate with Hadoop. Hive offers a simple way to apply structure to large amounts of unstructured data and then perform batch SQL-like queries on that data.
The Treasure Data Hive service (TD Hive) provides batch data processing on data stored in Treasure Data’s data lake, based on Apache Hive. TD eliminates the need to run your own Hadoop clusters to handle Hive processing. Instead, Treasure Data operates compute clusters for running Hive jobs.
The Customer Data Platform (CDP) application uses Hive jobs for some of its internal operations, and you can also run your own Hive jobs. You can submit SELECT or DML queries using Hive’s query language, using TD Console, API calls, the TD Toolbelt, or from TD workflows. The service queues, executes the queries, and returns the results. You can also design your system so that results are delivered to destinations specified in your Result Output.
The Hive query language (HiveQL) is the primary data processing method for Treasure Data. HiveQL is powered by Apache Hive. Treasure Data is a CDP that allows users to collect, store, and analyze their data on the cloud. Treasure Data manages its own Hadoop cluster, which accepts queries from users and executes them using the Hadoop MapReduce framework. HiveQL is one of the languages it supports.
About Trino (Presto) Distributed Query Engine
Presto is an open-source, distributed SQL query engine. Trino can query data where it is stored, without needing to move data into a separate analytics system. Treasure Data has customized Trino to talk directly with our distributed columnar storage layer.
Hive and Other Hadoop-Based Resource Pools
Hive Resource Pools allow you to divide resources available for Hive into pools to be used for specific workloads. You can then organize their usage across project, groups, or use cases.
Resource pools are helpful for the following challenges:
- Ensuring that high priority jobs receive the resources they need to run within a strict SLA. For example, a batch job that must finish within a specific overnight time window.
- Always saving some resources for ad hoc jobs that have to run with lowest possible latency.
- Limiting the maximum resources and cost of running lower priority jobs.
You can guarantee the minimum and maximum resources to allow for jobs that are running in specific resource pools.
This feature is enabled upon request. Contact TD Support or your primary account representative if you want to use Presto Resource Pools.
Understanding Hadoop Resources and Resource Pools
Treasure Data’s Hadoop cluster is shared among many customers. Each customer starts with a single queue for all submitted Hadoop jobs, such as Hive queries, which are processed as these resources permit. A number of parallel processes, based on your plan, is dedicated to processing those jobs. The number of cores is determined based on Hadoop Compute Units, as follows:
- A customer gets a total of 2 cores minimum guaranteed processing per compute unit at all times. For example, a customer with 20 Hadoop compute units gets 40 cores of minimum processing at all times.
- During off-peak periods on Treasure Data’s Hadoop cluster (across all customers), a customer may be granted up to 4x their guaranteed compute cores, to process their jobs faster.
When you create multiple resource pools, you designate the minimum percentage of resources to devote to them. For example, you might create three pools:
- ad hoc assigned 25% of plan capacity
- batch assigned 65% of capacity
- Best_effort assigned 10% of capacity
Based on the assigned percentage, a minimum of CPU cores is guaranteed for the jobs. So:
- ad hoc jobs will always get at least 10 (=25% of 40) cores
- batch jobs will always get at least 26 (=65% of 40) cores
- Best_effort jobs will always get at least 4 (=10% of 40) cores
The values are rounded down to the nearest whole number of cores. Resource pools are guaranteed a minimum of one core, unless configured to 0% allocation. During off-peak periods, jobs might be granted up to the full off-peak capacity of the whole plan. For example, any job could be granted up to 160 (=40*4) cores if the overall Treasure Data Hadoop workload permits.
If a job from one resource pool is running with more than its guaranteed resources and a job is submitted to another pool for which there are not currently enough resources, some portion of the processing for the first job may be preempted or deferred. This preemption does not mean that the job has to restart, only that it might take longer to run to completion. Jobs are always granted their minimum guaranteed resources.
Default Resource Pool Configuration
The default resource pool configuration for a Hadoop plan is for there to be one resource pool, named hadoop2 or hdp2. The default is configured with 100% resource.
To determine your current configured resource pools, contact Treasure Data support.
Submitting a Job with a Resource Pool Name that is not Configured
The job will run in the default resource pool with its configured available resources. The default is configured for 100% resource usage.
Selecting the Resource Pool Your Job Runs On
If resource pools are configured, and no resource pool is specified for a specific job, the job runs in the default resource pool. The default is configured with 100% resource and named hadoop2 or hdp2.
If you want to choose a specific resource pool for jobs that do not have a specified resource pool, contact Treasure Data support.
If you already have a resource pool name specified by a Hive query hint, providing a conflicting --pool-name
argument on the command line causes the job to fail.
TD Console Option
For Hive queries, you can specify a query hint or magic comment to specify a resource pool. For example:
-- @TD pool_name: batch
select count(1) from mytable;
Command-Line Options
To specify a resource pool for a Hive query, table export or bulk import job at the command line, add the -—pool-name
argument to td. For example:
A Hive query can be run as follows:
td query --type hive --database <database_name> --pool-name <resource_pool_name> "select count(1) from mytable"
A table export can be run as follows:
td table:export example_db table1 --s3-bucket mybucket -k KEY_ID -s SECRET_KEY --pool-name batch
JDBC Driver for Hive Query Engine
Learn how to use Treasure Data’s JDBC (Java Database Connectivity) driver, which enables you to use Treasure Data with a standard JDBC interface.
- Download
- Usage
- JDBC Parameter Precedence
- List of JDBC Configurations Parameters
- JDBC URL examples
- Query Execution Internals
Treasure Data JDBC Driver (td-jdbc) is an open-source project and is available from the Treasure Data td-jdbc repository.
Download
The current supported version is 0.5.11.
![image](../../../images/hive/340981.png '#box-shadow=0px 0px 2px 1px blue') ?>- The JDBC Driver Download is available at mvnrepository.com/artifact/com.treasuredata/td-jdbc
-
Use
td-jdbc-0.5.11-jar-with-dependencies.jar
For Maven projects, use the following dependency setting:
<dependency>
<groupId>com.treasuredata</groupId>
<artifactId>td-jdbc</artifactId>
<version>(version)</version>
</dependency>
Usage
Here is a sample code for counting the number of rows in www_access
table in sample_dataset
database. First, create a java.sql.Connection
object using JDBC address jdbc:td://api.treasuredata.com/(database name)
, then create java.sql.Statement
objects and call executeQuery(sql)
, etc. to run queries:
Properties props = new Properties();
props.setProperty("user", "(your account e-mail)");
props.setProperty("password", "(your password)");
// Alternatively, you can use API key instead of user and password
// props.setProperty("apikey", "(your API key)")
// For using SSL connection (default is false)
// props.setProperty("useSSL", "true");
// To run Hive jobs specify "type" parameter. The default is "presto"
// props.setProperty("type", "hive");
Connection conn = DriverManager.getConnection("jdbc:td://api.treasuredata.com/sample_datasets", props);
Statement st = conn.createStatement();
try {
ResultSet rs = st.executeQuery("SELECT count(1) FROM www_access");
while (rs.next()) {
int count = rs.getInt(1);
System.out.println("result = " + count);
}
rs.close();
}
finally {
st.close();
conn.close();
}
JDBC Parameter Precedence
To configure JDBC connection parameters, you can use URL parameters, Properties object or System properties. The precedence of these properties is:
-
Environment variable (only for
TD_API_KEY
parameter) - System properties
-
Properties object passed by
DriverManager.getConnection(jdbc_url, Properties)
-
JDBC URL parameters (e.g.,
jdbc:td://api.treasuredata.com/mydb;type=hive;useSSL=true
), separated by semicolons.
If your environment defines the TD_API_KEY
variable, td-jdbc uses it. For the other properties, system properties have the highest priority.
List of JDBC Configuration Parameters
You must provide apikey
property or both user
(your account e-mail) and password
for the authentication:
Key | Default value | Description |
---|---|---|
apikey | API key to access Treasure Data. You can also set this via TD_API_KEY environment variable. |
|
user | Account e-mail address (unnecessary if apikey is set) | |
password | Account password (unnecessary if apikey is set) | |
type | presto | Query engine. hive, presto or pig |
useSSL | false | Use SSL encryption for accessing Treasure Data |
httpproxyhost | Proxy host (optional) e.g., “myproxy.com” | |
httpproxyport | Proxy port (optional) e.g., “80” | |
httpproxyuser | Proxy user (optional) | |
httpproxypassword | Proxy password (optional) | |
If both user
and password
are given, td-jdbc uses this pair instead of apikey
.
JDBC URL examples
When some SQL/BI tool has no functionality to set these properties, use URL parameters. For example, here is an example to set the useSSL
parameter in the URL:
jdbc:td://api.treasuredata.com/<db_name>;useSSL=true
To access a database named “sample_db” in your account:
jdbc:td://api.treasuredata.com/sample_db;useSSL=true
You can choose the query engine Hive or Presto (default) using the type
parameter:
jdbc:td://api.treasuredata.com/sample_db;useSSL=true;type=hive
jdbc:td://api.treasuredata.com/sample_db;useSSL=true;type=presto
To connect Treasure Data through a proxy server, specify the following proxy settings:
- httpproxyhost
- httpproxyport
- httpproxyuser
- httpproxypassword
For example:
jdbc:td://api.treasuredata.com/testdb;httpproxyhost=myproxy.com;httpproxyport=myport;httpproxyuser=myusername;httpproxypassword=mypassword
Query Execution Internals
When running a query (e.g. SELECT), the driver submits a job request to Treasure Data. td-jdbc periodically monitors the job progress and fetches the result after the job completion.
For INSERT statement, td-jdbc buffers the data into local memory, then flushes it to Treasure Data every 5 minutes, so there will be a delay until your data becomes accessible in Treasure Data.