Accessing Hive 2

The current version of Hive on Treasure Data is 2023.1 (also known as Hive 4).

Because Treasure Data is a cloud-based hosting environment, the platform has certain constraints that do not exist in a local Hadoop setup. These constraints ensure that your application can take advantage of Treasure Data’s scalability and redundancy features.

TD Hive Known Limitations

The following limitations exist in TD-Hive.

Custom Hive and MapReduce Code Support

Because the TD schema is flexible, there are limitations on DDL when using TD Hive.

Treasure Data limits the use of custom code in our Hadoop and Hive environment because we do not provide dedicated Hive clusters for individual customers. Consequently, there are the following limitations:

  • Custom Hive UDFs cannot be added, because these UDFs would be exposed for use by all customers.
  • Custom MapReduce scripts cannot be added due to security and system stability considerations.

Alternatives for running custom code within TD include:

Single Query Limitation

A single query cannot commit more than 500,000 (500K) partitions. To avoid this, define the time range of the query to remain within this limitation.

Single Query Export Limitation of 100GB

If you encounter an error similar to this:

"Error: The number of chunks for multipart upload is exceeded. Please reduce the size of the uploading file.

It means the size of the total number of chunks created by a single query exceeded 100GB.

As workarounds:

SELECT Row Limit

The maximum number of rows for SELECT queries is 2^31 (2,147,483,647) on both CDH4 and HDP2. If the number of rows exceeds the limit, the query fails.

As a workaround, use INSERT INTO to insert more records.

Union All Limitation

When using Union ALL in combination with Percentile and a Lateral view can result in an error.

Bad Code Example

This code receives the following error:

FAILED: SemanticException. Exception when trying to remove partition predicates: fail to find child from parent

Copy
Copied
SELECT
col2
FROM
www_access
lateral view explode(ARRAY(1, 2, 3, 4)) t1 as col2
UNION ALL
SELECT
SUM(1) as col1
UNION ALL
SELECT
percentile(100, 0.9) as col1

Good Code Example

Copy
Copied
SELECT
col2
FROM
www_access
lateral view explode(ARRAY(1, 2, 3, 4)) t1 as col2
UNION ALL
SELECT
percentile(100, 0.9) as col1
UNION ALL
SELECT
SUM(1) as col1

Division

When two values (of any type) are divided in Hive the result of the division is always automatically upgraded to 'double'.
To obtain an integer result from a division, cast the result of the division to the needed type. For example:

Copy
Copied
SELECT
CAST((column1 / column2) AS bigint) as division
FROM
table
Info

The Presto engine behaves differently with divisions and does not automatically upgrade the result to double.

Trim Double-Byte Space

TRIM function with Hive can't handle a double-byte space. For example:

Copy
Copied
SELECT
TRIM(\' a\') as value_keyword
FROM
table
=\> \' a\' (Expected result is \'a\')

As a workflow, use regexp_replace

Copy
Copied
SELECT
regexp_replace(TRIM(\' a \'), \'\^ +\| +\$\', \'\') as value_keyword
FROM
table

Hive Floating Point

Hive adds extra decimals to float point (see float and double calculation is inaccurate in Hive. The following example illustrates the behavior:

  1. Create a table floattest with 2 columns:
    • floatval for float datatype
    • doubleval for double datatype
  2. Insert values into floattest table:
    • 123123.2 as floatval
    • 123123.2 as doubleval
  3. Select cast (floatval as double) from floattest.

    Select floatval from floattest returned : 123123.203125.

    It is consistent with the output from Java because the data is stored as a double.

    Select doubleval from floattest returned : 123123.2

If you are concerned about the precision of numbers in your output, use double instead of float.

Hadoop HDFS Data Partitioning

In a standard Hadoop/Hive cluster, users include more explicit information about partitioning, bucketing data, and directory hierarchies for partition placement. Hive features like static and dynamic partitioning depend on this data being provided in each query. TD manages many of the storage details transparently.

Treasure Data does support time-based and user-defined partitioning.

However other Hive standard partitioning mechanisms are not relevant and are not supported.

DDL Operations Performed External to Hive

Creating data schema objects using Hive DDL is limited because TD manages the creation of tables and the definition of schema external to the Hive query language. For example, TD infers schema from ingested data. As a convenience, some Hive DDL features are supported but other components in the platform may also manipulate schema.

For more information see Hive DDL.

Hive Data Types and TD Native Data Types

While the full range Hive data types can be used in any expression, when inserting data values into TD tables, you may have to CAST the Hive data type to a type supported for storing values in TD.

Hive Metastore Not Needed

Treasure Data’s platform manages table schema and partition metadata in its own metastore instead of using a standard Hive Metastore. This supports our flexible table schema capabilities. This is only a limitation if you use external tools to work with data lakes that rely on a Hive Metastore.

Security and Authentication

The security and authentication model for Hive in our platform follows the security and authentication model of the platform itself.

TD Hive on TEZ

Apache Tez™ is an application framework built on Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. In many ways, it can be thought of as a more flexible and powerful successor of the map-reduce framework. The main advantage of Hive on Tez is better performance.

Hive's Reserved Keywords

Also refer to Language Manual DDL.

Version Non-reserved Keywords Reserved Keywords
Hive 1.2.0 ADD, ADMIN, AFTER, ANALYZE, ARCHIVE, ASC, BEFORE, BUCKET, BUCKETS, CASCADE, CHANGE, CLUSTER, CLUSTERED, CLUSTERSTATUS, COLLECTION, COLUMNS, COMMENT, COMPACT, COMPACTIONS, COMPUTE, CONCATENATE, CONTINUE, DATA, DATABASES, DATETIME, DAY, DBPROPERTIES, DEFERRED, DEFINED, DELIMITED, DEPENDENCY, DESC, DIRECTORIES, DIRECTORY, DISABLE, DISTRIBUTE, ELEM_TYPE, ENABLE, ESCAPED, EXCLUSIVE, EXPLAIN, EXPORT, FIELDS, FILE, FILEFORMAT, FIRST, FORMAT, FORMATTED, FUNCTIONS, HOLD_DDLTIME, HOUR, IDXPROPERTIES, IGNORE, INDEX, INDEXES, INPATH, INPUTDRIVER, INPUTFORMAT, ITEMS, JAR, KEYS, KEY_TYPE, LIMIT, LINES, LOAD, LOCATION, LOCK, LOCKS, LOGICAL, LONG, MAPJOIN, MATERIALIZED, METADATA, MINUS, MINUTE, MONTH, MSCK, NOSCAN, NO_DROP, OFFLINE, OPTION, OUTPUTDRIVER, OUTPUTFORMAT, OVERWRITE, OWNER, PARTITIONED, PARTITIONS, PLUS, PRETTY, PRINCIPALS, PROTECTION, PURGE, READ, READONLY, REBUILD, RECORDREADER, RECORDWRITER, REGEXP, RELOAD, RENAME, REPAIR, REPLACE, REPLICATION, RESTRICT, REWRITE, RLIKE, ROLE, ROLES, SCHEMA, SCHEMAS, SECOND, SEMI, SERDE, SERDEPROPERTIES, SERVER, SETS, SHARED, SHOW, SHOW_DATABASE, SKEWED, SORT, SORTED, SSL, STATISTICS, STORED, STREAMTABLE, STRING, STRUCT, TABLES, TBLPROPERTIES, TEMPORARY, TERMINATED, TINYINT, TOUCH, TRANSACTIONS, UNARCHIVE, UNDO, UNIONTYPE, UNLOCK, UNSET, UNSIGNED, URI, USE, UTC, UTCTIMESTAMP, VALUE_TYPE, VIEW, WHILE, YEAR ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH, BY, CASE, CAST, CHAR, COLUMN, CONF, CREATE, CROSS, CUBE, CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DECIMAL, DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED, EXTERNAL, FALSE, FETCH, FLOAT, FOLLOWING, FOR, FROM, FULL, FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE, NONE, NOT, NULL, OF, ON, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN, PARTITION, PERCENT, PRECEDING, PRESERVE, PROCEDURE, RANGE, READS, REDUCE, REVOKE, RIGHT, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT, TABLE, TABLESAMPLE, THEN, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE, UNBOUNDED, UNION, UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, WHEN, WHERE, WINDOW, WITH
Hive 2.0.0 removed:

REGEXP, RLIKE

added:

AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE
added:

COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START
Hive 2.1.0 added:

ABORT, KEY, LAST, NORELY, NOVALIDATE, NULLS, RELY, VALIDATE
added:

CACHE, CONSTRAINT, FOREIGN, PRIMARY, REFERENCES
Hive 2.2.0 added:

DETAIL, DOW, EXPRESSION, OPERATOR, QUARTER, SUMMARY, VECTORIZATION, WEEK, YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES, SECONDS
added:

DAYOFWEEK, EXTRACT, FLOOR, INTEGER, PRECISION, VIEWS