Accessing Hive 2
The current version of Hive on Treasure Data is 2023.1 (also known as Hive 4).
Because Treasure Data is a cloud-based hosting environment, the platform has certain constraints that do not exist in a local Hadoop setup. These constraints ensure that your application can take advantage of Treasure Data’s scalability and redundancy features.
TD Hive Known Limitations
The following limitations exist in TD-Hive.
Custom Hive and MapReduce Code Support
Because the TD schema is flexible, there are limitations on DDL when using TD Hive.
Treasure Data limits the use of custom code in our Hadoop and Hive environment because we do not provide dedicated Hive clusters for individual customers. Consequently, there are the following limitations:
- Custom Hive UDFs cannot be added, because these UDFs would be exposed for use by all customers.
- Custom MapReduce scripts cannot be added due to security and system stability considerations.
Alternatives for running custom code within TD include:
- Running custom code in Docker containers (see Custom Scripts ).
Single Query Limitation
A single query cannot commit more than 500,000 (500K) partitions. To avoid this, define the time range of the query to remain within this limitation.
Single Query Export Limitation of 100GB
If you encounter an error similar to this:
"Error: The number of chunks for multipart upload is exceeded. Please reduce the size of the uploading file.
It means the size of the total number of chunks created by a single query exceeded 100GB.
As workarounds:
-
Use the
result_output_redirect
option - Use CREATE TABLE AS
- Split query by time partioning. (See Leveraging Time-based Partitioning .)
- Use INSERT INTO or INSERT OVERWRITE. (See Hive-INSERT-Syntax-in-Treasure-Data .)
SELECT Row Limit
The maximum number of rows for SELECT queries is 2^31 (2,147,483,647) on both CDH4 and HDP2. If the number of rows exceeds the limit, the query fails.
As a workaround, use INSERT INTO to insert more records.
Union All Limitation
When using Union ALL in combination with Percentile and a Lateral view can result in an error.
Bad Code Example
This code receives the following error:
FAILED: SemanticException. Exception when trying to remove partition predicates: fail to find child from parent
SELECT
col2
FROM
www_access
lateral view explode(ARRAY(1, 2, 3, 4)) t1 as col2
UNION ALL
SELECT
SUM(1) as col1
UNION ALL
SELECT
percentile(100, 0.9) as col1
Good Code Example
SELECT
col2
FROM
www_access
lateral view explode(ARRAY(1, 2, 3, 4)) t1 as col2
UNION ALL
SELECT
percentile(100, 0.9) as col1
UNION ALL
SELECT
SUM(1) as col1
Division
When two values (of any type) are divided in Hive the result of the
division is always automatically upgraded to 'double'.
To obtain an integer result from a division, cast the result of the
division to the needed type. For example:
SELECT
CAST((column1 / column2) AS bigint) as division
FROM
table
Info
The Presto engine behaves differently with divisions and does not automatically upgrade the result to double.
Trim Double-Byte Space
TRIM function with Hive can't handle a double-byte space. For example:
SELECT
TRIM(\' a\') as value_keyword
FROM
table
=\> \' a\' (Expected result is \'a\')
As a workflow, use regexp_replace
SELECT
regexp_replace(TRIM(\' a \'), \'\^ +\| +\$\', \'\') as value_keyword
FROM
table
Hive Floating Point
Hive adds extra decimals to float point (see float and double calculation is inaccurate in Hive. The following example illustrates the behavior:
-
Create a table floattest with 2 columns:
- floatval for float datatype
- doubleval for double datatype
-
Insert values into floattest table:
- 123123.2 as floatval
- 123123.2 as doubleval
-
Select cast (floatval as double) from floattest.
Select floatval from floattest returned : 123123.203125.
It is consistent with the output from Java because the data is stored as a double.
Select doubleval from floattest returned : 123123.2
If you are concerned about the precision of numbers in your output, use double instead of float.
Hadoop HDFS Data Partitioning
In a standard Hadoop/Hive cluster, users include more explicit information about partitioning, bucketing data, and directory hierarchies for partition placement. Hive features like static and dynamic partitioning depend on this data being provided in each query. TD manages many of the storage details transparently.
Treasure Data does support time-based and user-defined partitioning.
However other Hive standard partitioning mechanisms are not relevant and are not supported.
DDL Operations Performed External to Hive
Creating data schema objects using Hive DDL is limited because TD manages the creation of tables and the definition of schema external to the Hive query language. For example, TD infers schema from ingested data. As a convenience, some Hive DDL features are supported but other components in the platform may also manipulate schema.
For more information see Hive DDL.
Hive Data Types and TD Native Data Types
While the full range Hive data types can be used in any expression, when inserting data values into TD tables, you may have to CAST the Hive data type to a type supported for storing values in TD.
Hive Metastore Not Needed
Treasure Data’s platform manages table schema and partition metadata in its own metastore instead of using a standard Hive Metastore. This supports our flexible table schema capabilities. This is only a limitation if you use external tools to work with data lakes that rely on a Hive Metastore.
Security and Authentication
The security and authentication model for Hive in our platform follows the security and authentication model of the platform itself.
TD Hive on TEZ
Apache Tez™ is an application framework built on Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. In many ways, it can be thought of as a more flexible and powerful successor of the map-reduce framework. The main advantage of Hive on Tez is better performance.
Hive's Reserved Keywords
Also refer to Language Manual DDL.
Version | Non-reserved Keywords | Reserved Keywords |
---|---|---|
Hive 1.2.0 | ADD, ADMIN, AFTER, ANALYZE, ARCHIVE, ASC, BEFORE, BUCKET, BUCKETS, CASCADE, CHANGE, CLUSTER, CLUSTERED, CLUSTERSTATUS, COLLECTION, COLUMNS, COMMENT, COMPACT, COMPACTIONS, COMPUTE, CONCATENATE, CONTINUE, DATA, DATABASES, DATETIME, DAY, DBPROPERTIES, DEFERRED, DEFINED, DELIMITED, DEPENDENCY, DESC, DIRECTORIES, DIRECTORY, DISABLE, DISTRIBUTE, ELEM_TYPE, ENABLE, ESCAPED, EXCLUSIVE, EXPLAIN, EXPORT, FIELDS, FILE, FILEFORMAT, FIRST, FORMAT, FORMATTED, FUNCTIONS, HOLD_DDLTIME, HOUR, IDXPROPERTIES, IGNORE, INDEX, INDEXES, INPATH, INPUTDRIVER, INPUTFORMAT, ITEMS, JAR, KEYS, KEY_TYPE, LIMIT, LINES, LOAD, LOCATION, LOCK, LOCKS, LOGICAL, LONG, MAPJOIN, MATERIALIZED, METADATA, MINUS, MINUTE, MONTH, MSCK, NOSCAN, NO_DROP, OFFLINE, OPTION, OUTPUTDRIVER, OUTPUTFORMAT, OVERWRITE, OWNER, PARTITIONED, PARTITIONS, PLUS, PRETTY, PRINCIPALS, PROTECTION, PURGE, READ, READONLY, REBUILD, RECORDREADER, RECORDWRITER, REGEXP, RELOAD, RENAME, REPAIR, REPLACE, REPLICATION, RESTRICT, REWRITE, RLIKE, ROLE, ROLES, SCHEMA, SCHEMAS, SECOND, SEMI, SERDE, SERDEPROPERTIES, SERVER, SETS, SHARED, SHOW, SHOW_DATABASE, SKEWED, SORT, SORTED, SSL, STATISTICS, STORED, STREAMTABLE, STRING, STRUCT, TABLES, TBLPROPERTIES, TEMPORARY, TERMINATED, TINYINT, TOUCH, TRANSACTIONS, UNARCHIVE, UNDO, UNIONTYPE, UNLOCK, UNSET, UNSIGNED, URI, USE, UTC, UTCTIMESTAMP, VALUE_TYPE, VIEW, WHILE, YEAR | ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH, BY, CASE, CAST, CHAR, COLUMN, CONF, CREATE, CROSS, CUBE, CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DECIMAL, DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED, EXTERNAL, FALSE, FETCH, FLOAT, FOLLOWING, FOR, FROM, FULL, FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE, NONE, NOT, NULL, OF, ON, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN, PARTITION, PERCENT, PRECEDING, PRESERVE, PROCEDURE, RANGE, READS, REDUCE, REVOKE, RIGHT, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT, TABLE, TABLESAMPLE, THEN, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE, UNBOUNDED, UNION, UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, WHEN, WHERE, WINDOW, WITH |
Hive 2.0.0 | removed: REGEXP, RLIKE added: AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE |
added: COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START |
Hive 2.1.0 | added: ABORT, KEY, LAST, NORELY, NOVALIDATE, NULLS, RELY, VALIDATE |
added: CACHE, CONSTRAINT, FOREIGN, PRIMARY, REFERENCES |
Hive 2.2.0 | added: DETAIL, DOW, EXPRESSION, OPERATOR, QUARTER, SUMMARY, VECTORIZATION, WEEK, YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES, SECONDS |
added: DAYOFWEEK, EXTRACT, FLOOR, INTEGER, PRECISION, VIEWS |