Accessing Hive 4

Support for Hive 4 is now available on the Treasure Data console.

Specifying Hive 4 on the TD Console

After your account has been enabled for Hive 4 functionality, the drop-down menu for Type in the Queries window will include Hive 2023.1. Selecting this option will cause TD to process your query with Hive 4.

Specifying Hive 4 using the TD Toolbelt

Using TD Toolbelt, you can specify a Hive 4 query by including the command line option
-T hive --engine-version stable.

For example, a Hive 4 query would look something like this:

Copy

Copied

td query -w -d sample_datasets "SELECT COUNT(1) FROM www_access" -T hive --engine-version stable

Specifying Hive 4 with the Python Client

Using the Python Client you can specify a Hive 4 query by specifying the engine_version as stable. For example you can schedule a job using Hive 4 with the following Python code:

QueryCreate Schedule

Copy

Copied

import tdclient

with tdclient.Client() as td:
    job = td.query(
        "sample_datasets",
        "SELECT COUNT(1) FROM www_access",
        type="hive",
        engine_version="stable",
    )

    job.wait()
    for row in job.result():
        print(repr(row))

Copy

Copied

import tdclient

with tdclient.Client() as td:
    td.create_schedule(
        "sched-example",
        {
            "type": "hive",
            "engine_version": "stable",
            "query": "select count(1) from www_access",
            "database": "sample_datasets",
            "cron": "0 13 * * *",
        },
    )

New SQL Function support

TD has added support for the following SQL features in Hive 4:

Quote(str) function

Quotes a string to produce a result that can be used as a properly escaped data value in an SQL statement. The string is returned enclosed by single quotation marks and with each instance of backslash, single quote ('), ASCII NUL, or Control+Z preceded by a backslash.

If the argument is NULL, the return value is the word “NULL” without enclosing single quotation marks.

Copy

Copied

SELECT str, quote(str) FROM test_quote;

+-------------------+----------------------+
|        str        |         _c1          |
+-------------------+----------------------+
| Don't say "lazy"  | 'Don\'t say "lazy"'  |
+-------------------+----------------------+

JSON_READ(json,type) function

JSON_READ parses the given json according to the given complex type specification.

Copy

Copied

WITH parsed AS (
  SELECT json_read('[{"user_id": 1, "item_ids": [1, 5]}, {"user_id": 2, "item_ids": [2, 8]}]', 'array<struct<user_id:int,item_ids:array<int>>>') AS users
)
SELECT
  users[0].user_id AS user_id_0, users[0].item_ids AS item_ids_0,
  users[1].user_id AS user_id_0, users[1].item_ids AS item_ids_0,
  users
FROM parsed;

+------------+-------------+--------------+---------------+----------------------------------------------------+
| user_id_0  | item_ids_0  | user_id_0_1  | item_ids_0_1  |                       users                        |
+------------+-------------+--------------+---------------+----------------------------------------------------+
| 1          | [1,5]       | 2            | [2,8]         | [{"user_id":1,"item_ids":[1,5]},{"user_id":2,"item_ids":[2,8]}] |
+------------+-------------+--------------+---------------+----------------------------------------------------+

Sinh, cosh, tanh functions

The following hyperbolic built-in functions for MATH usage have been added: sinh, cosh, tanh.

Tumbling_window(timestamp)

This function rounds timestamp agr1 to the beginning of window interval where it belongs to.

Copy

Copied

SELECT tumbling_window(cast('2024-05-12 15:23:59' as timestamp), interval '2' MINUTES);

+------------------------+
|          _c0           |
+------------------------+
| 2024-05-12 15:22:00.0  |
+------------------------+

CAST + FORMAT

Introduced the FORMAT clause to CAST statements as well as to SQL:2016 datetime formats. Here is a limited list of datetime formats supported: YYYY, MM, DD, HH, MI, SS, YYY, YY, Y, RRRR, RR, DDD, HH12, HH24, SSSSS, FF[1..9], AM/A.M., PM/P.M., TZH, TZM.

Copy

Copied

SELECT
  cast(cast('2024-05-12 15:23:59' as timestamp) as string format 'yyyy-MM-dd'),
  cast('2024/05/12 15:23:59' as timestamp format 'yyyy/MM/dd HH24:MI:ss');

+-------------+------------------------+
|     _c0     |          _c1           |
+-------------+------------------------+
| 2024-05-12  | 2024-05-12 15:23:59.0  |
+-------------+------------------------+

Array type functions

array_min(array(obj1, obj2,...)): Returns the minimum value in an array with elements for which order is supported, returns null if array is empty
array_max(array(obj1, obj2,...)): Returns the maximum value in an array with elements for which order is supported, returns null if array is empty
array_distinct(array(obj1, obj2,...)): Returns an array of the same type as the input argument where all duplicate values have been removed.
array_join(array, delimiter, replaceNull): Concatenate the elements of an array with a specified delimiter Example:
array_slice(array, start, length): Returns the subset or range of elements from an array (subarray)
array_except(array1, array2): Returns an array of the elements in array1 but not in array2, without duplicates.
array_intersect(array1, array2): Returns an array of the elements in the intersection of array1 and array2, without duplicates.
array_union(array1, array2): Returns an array of the elements in the union of array1 and array2 without duplicates.
array_remove(array, element): Removes all occurrences of element from array.

percentile_cont, percentile_disc

Copy

Copied

SELECT
  percentile_cont(ws_sales_price, 0.5),
  percentile_disc(ws_sales_price, 0.5)
FROM web_sales;

approx_distinct

DataSketches functions

If approximate results are acceptable, there is a class of specialized algorithms called streaming algorithms, or sketches, that can produce results orders of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution.

Hive4 integrate with Apache Datasketches library more closely

HyperLogLog
- ds _ hll _ estimate
- ds _ hll _ estimate _ bounds
- ds _ hll _ sketch
- ds _ hll _ stringify
- ds _ hll _ union
- ds _ hll _ union _ f
Theta Sketch
- ds _ theta _ estimate
- ds _ theta _ exclude
- ds _ theta _ intersect
- ds _ theta _ intersect _ f
- ds _ theta _ sketch
- ds _ theta _ union
- ds _ theta _ union _ f
Tuple Sketch
- ds _ tuple _ arrayofdouble _ estimate
- ds _ tuple _ arrayofdouble _ estimate _ bounds
- ds _ tuple _ arrayofdouble _ means
etc...

typeof

This function returns a string describing the type of the argument.

Copy

Copied

SELECT typeof('text'), typeof(1), typeof(CAST(NULL AS bigint)), typeof(array(1.0));

+---------+------+---------+----------------------+
|   _c0   | _c1  |   _c2   |         _c3          |
+---------+------+---------+----------------------+
| string  | int  | bigint  | array<decimal(1,0)>  |
+---------+------+---------+----------------------+

get_sql_schema

This function helps to retrieve each column name and it's data type.

Copy

Copied

SELECT get_sql_schema('select ws_item_sk, max(ws_quantity) as max_quantity from web_sales group by ws_item_sk');

+---------------+-----------+
|   col_name    | col_type  |
+---------------+-----------+
| ws_item_sk    | bigint    |
| max_quantity  | int       |
+---------------+-----------+

Hive 4 Limitations and Differences

The following limitations or differences exist between the stable version of Hive 2 (Hive 2020.1) and Hive 4.

`ORDER BY` Sorts Nulls Last

While not necessarily a limitation, for ORDER_BY, the 2023.1 default ascending sort order (ASC) is NULLS LAST, and the default descending sort order (DESC) is NULLS FIRST.

Copy

Copied

ORDER BY name;

null            aaa
aaa      ->     bbb
bbb             null

And, to use traditional sorting you can use the following keyword:

Copy

Copied

ORDER BY name NULLS FIRST;

Copy

Copied

ORDER BY name NULLS LAST;

The `mask_hash` UDF Algorithm Has Changed.

The default mask has changed from MD5 to SHA256.

Time Index Log Messages Have Changed

In Hive 4, you will see a change in the number of log messages related to time index filtering in output log. These messages will look something like this:

Copy

Copied

** WARNING: time index filtering is not set on suprith_test.million_partitions!
** This query could be very slow as a result.
** If you used 'unix_timestamp' please modify your query to use TD_SCHEDULED_TIME instead
**   or rewrite the condition using TD_TIME_RANGE
** Please see https://docs.treasuredata.com/display/public/PD/Hive+Performance+Tuning#HivePerformanceTuning-LeveragingTime-basedPartitioning

New Property to Disable Sorting of Subqueries

Hive3 added the hive.remove.orderby.in.subquery property to disable sorting in subqueries (see HIVE-6348.This feature continues in Hive 4. Functions that rely on UDF sorting, such as TD_SESSIONIZE are not affected by this change.

"True" and "False" Strings Resolve Differently

The strings "true" and "false" can now be evaluated as TRUE/FALSE using the AbstractGenericUDAFResolver. However, strings such as "aaa" still return true, but functions like TD_AVGIF and TD_SUMIF will return different results given “false” as an argument.

Hive 2 TD_SUMIFHive 4 TD_SUMIF

Copy

Copied

SELECT
  TD_SUMIF(1, "true"),   --> 1
  TD_SUMIF(1, "false"),  --> 1
  TD_SUMIF(1, "aaa"),    --> 1

Copy

Copied

SELECT
  TD_SUMIF(1, "true"),   --> 1
  TD_SUMIF(1, "false"),  --> null
  TD_SUMIF(1, "aaa"),    --> 1

`TIMESTAMP WITH LOCAL TIME ZONE` is Supported

While this feature is support, be aware that it is currently impossible to change the default time zone used during the execution of the query. Therefore, the “local” time zone used will always be GMT+0 (UTC).

Cast NULL Literals in Common Table Expressions

In some cases, common table expressions (CTEs) can fail if you use a null literal in a SELECT statement. For example, here is CTE that fails:

Copy

Copied

WITH x AS (SELECT null AS mofu, 'POST' AS method)
SELECT t1.method
FROM sample_datasets.www_access t1
WHERE EXISTS (SELECT 1 FROM x t2 WHERE t1.method=t2.method)
UNION ALL
SELECT x.method FROM x

ql.Driver: FAILED: SemanticException [Error 10305]: CREATE-TABLE-AS-SELECT creates a VOID type, please use CAST to specify the type, near field:  mofu
org.apache.hadoop.hive.ql.parse.SemanticException: CREATE-TABLE-AS-SELECT creates a VOID type, please use CAST to specify the type, near field:  mofu

You should explicitly cast the null literal to the expected type. For example, in this CTE null is cast to type string:

Copy

Copied

WITH x AS (SELECT CAST(null AS STRING) AS mofu, 'POST' AS method)

Stateful UDFs Can Only Be Invoked in the SELECT List

Stateful UDFs such as TD_X_RANK cannot be used anywhere other than in the SELECT list. For example, specifying TD_X_RANK in the HAVING clause causes this query to fail:

Copy

Copied

SELECT col1, col2 FROM table
HAVING TD_X_RANK(col2) < 10

FAILED: SemanticException [Error 10084]: Stateful UDF's can only be invoked in the SELECT list
org.apache.hadoop.hive.ql.parse.SemanticException: Stateful UDF's can only be invoked in the SELECT list

To fix this you use a window function with a QUALIFY filter.

Window function in the select list with a QUALIFY clauseWindow function in the QUALIFY clause

Copy

Copied

SELECT
  col1, col2,
  ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2 DESC) AS rank
FROM table
QUALIFY
  rank < 10

Copy

Copied

SELECT
  col1, col2,
FROM table
QUALIFY
  ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2 DESC) < 10

Aliasing `HAVING` with CTE

In some cases, due to query compilation optimizations introduced in Hive 2023.1, HAVING clauses that reference aliases inside CTEs no longer work.

Copy

Copied

WITH x (
  SELECT col1, COUNT(*) AS num FROM table1
  GROUP BY 1
  HAVING NUM > 10
)
SELECT * FROM table2 WHERE col1 IN (SELECT col1 FROM x)

FAILED: CalciteSubquerySemanticException Encountered Select alias 'num' in having clause 'num > 10' This non standard behavior is not supported with cbo on. Turn off cbo for these queries.

Rewrite your CTE so the HAVING clause does not reference an alias.

Copy

Copied

WITH x (
  SELECT col1, COUNT(*) AS num FROM table1
  GROUP BY 1
  HAVING COUNT(*) > 10
)
SELECT * FROM table2 WHERE col1 IN (SELECT col1 FROM x)

Nested Parentheses in a Sub-query Don't Parse

Nested parentheses within subqueries are no longer allowed in Hive 2023.1.

Copy

Copied

SELECT col1, AVG(col2)
FROM (
  (
    SELECT col1, col2 FROM table WHERE time = 252691200
  )
) sub
GROUP BY col1

org.apache.hadoop.hive.ql.parse.ParseException: line 6:0 cannot recognize input near ')' 'sub' 'GROUP' in subquery source

Rewrite your subquery without nesting parentheses.

Copy

Copied

SELECT col1, AVG(col2)
FROM (
  SELECT col1, col2 FROM table WHERE time = 252691200
) sub
GROUP BY col1

YEAR UDF cannot be a numeric type

2023.1 throws a syntax error when you specify a numeric value for YEAR. The same applies to other UDFs(e.g. MONTH, DAY, HOUR, MINUTE, SECOND and WEEKOFYEAR) that handle dates. The value you specify for these UDFs must be a timestamp string.

Copy

Copied

SELECT YEAR(2024)

FAILED: SemanticException [Error 10016]: Line 1:12 Argument type mismatch '2024': year does not take INT type

To resolve this issue specify a valid timestamp string.

Copy

Copied

SELECT YEAR('2024-01-01')

TD UDFs with a database name

{any db name}.TD_••• succeeds with 2020.1, but fails with 2023.1.

Here is an example of how TDURLDECODE with a database name fails in 2023.1:

Copy

Copied

SELECT sample_datasets.TD_URL_DECODE(path) FROM sample_datasets.www_access

FAILED: SemanticException [Error 10011]: Invalid function sample_datasets.TD_URL_DECODE

You can avoid the problem by removing the database name from the UDF identifier.

Copy

Copied

SELECT TD_URL_DECODE(path) FROM sample_datasets.www_access

Reserved Keywords of 2023.1

Hive 2023.1 introduced the following reserved keywords. If you use any of these keywords as table names or column names, you will need to quote them with backticks.

ANY
APPLICATION
COMPACTIONID
CONNECTORa
CONNECTORS
CONVERT
DDL
DEC
FORCE
LEADING
NUMERIC
OLDER
PKFK _ JOIN
PREPARE
QUALIFY
REAL
SOME
SYNC
TIMESTAMPLOCALTZ
THAN
TRAILING
UNIQUE

The keywords for Hive 2023.1 are almost identical to the Keywords for Hive 4.0.0 except for TIME which is non-reserved in the Treasure Data environment.

Version	Non-reserved Keywords	Reserved Keywords
Hive 1.2.0	ADD, ADMIN, AFTER, ANALYZE, ARCHIVE, ASC, BEFORE, BUCKET, BUCKETS, CASCADE, CHANGE, CLUSTER, CLUSTERED, CLUSTERSTATUS, COLLECTION, COLUMNS, COMMENT, COMPACT, COMPACTIONS, COMPUTE, CONCATENATE, CONTINUE, DATA, DATABASES, DATETIME, DAY, DBPROPERTIES, DEFERRED, DEFINED, DELIMITED, DEPENDENCY, DESC, DIRECTORIES, DIRECTORY, DISABLE, DISTRIBUTE, ENABLE, ESCAPED, EXCLUSIVE, EXPLAIN, EXPORT, FIELDS, FILE, FILEFORMAT, FIRST, FORMAT, FORMATTED, FUNCTIONS, HOLD_DDLTIME, HOUR, IDXPROPERTIES, IGNORE, INDEX, INDEXES, INPATH, INPUTDRIVER, INPUTFORMAT, ITEMS, JAR, KEYS, LIMIT, LINES, LOAD, LOCATION, LOCK, LOCKS, LOGICAL, LONG, MAPJOIN, MATERIALIZED, METADATA, MINUS, MINUTE, MONTH, MSCK, NOSCAN, NO_DROP, OFFLINE, OPTION, OUTPUTDRIVER, OUTPUTFORMAT, OVERWRITE, OWNER, PARTITIONED, PARTITIONS, PLUS, PRETTY, PRINCIPALS, PROTECTION, PURGE, READ, READONLY, REBUILD, RECORDREADER, RECORDWRITER, REGEXP, RELOAD, RENAME, REPAIR, REPLACE, REPLICATION, RESTRICT, REWRITE, RLIKE, ROLE, ROLES, SCHEMA, SCHEMAS, SECOND, SEMI, SERDE, SERDEPROPERTIES, SERVER, SETS, SHARED, SHOW, SHOW_DATABASE, SKEWED, SORT, SORTED, SSL, STATISTICS, STORED, STREAMTABLE, STRING, STRUCT, TABLES, TBLPROPERTIES, TEMPORARY, TERMINATED, TINYINT, TOUCH, TRANSACTIONS, UNARCHIVE, UNDO, UNIONTYPE, UNLOCK, UNSET, UNSIGNED, URI, USE, UTC, VIEW, WHILE, YEAR	ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH, BY, CASE, CAST, CHAR, COLUMN, CONF, CREATE, CROSS, CUBE, CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DECIMAL, DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED, EXTERNAL, FALSE, FETCH, FLOAT, FOLLOWING, FOR, FROM, FULL, FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE, NONE, NOT, NULL, OF, ON, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN, PARTITION, PERCENT, PRECEDING, PRESERVE, PROCEDURE, RANGE, READS, REDUCE, REVOKE, RIGHT, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT, TABLE, TABLESAMPLE, THEN, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE, UNBOUNDED, UNION, UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, WHEN, WHERE, WINDOW, WITH
Hive 2.0.0	removed: HOLD_DDLTIME, IGNORE, NO_DROP, OFFLINE, PROTECTION, READONLY, REGEXP, RLIKE added: AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE	added: COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START
Hive 2.1.0	added: ABORT, KEY, LAST, NORELY, NOVALIDATE, NULLS, RELY, VALIDATE	added: CACHE, CONSTRAINT, FOREIGN, PRIMARY, REFERENCES
Hive 2.2.0	removed: MINUS added: CACHE, DAYS, DAYOFWEEK, DUMP, HOURS, MATCHED, MERGE, MINUTES, MONTHS, QUARTER, REPL, SECONDS, STATUS, VIEWS, WEEK, WEEKS, YEARS	removed: CACHE added: EXCEPT, EXTRACT, FLOOR, GROUPING, INTEGER, MINUS, PRECISION
Hive 2.3.0	removed: MERGE added: DETAIL, EXPRESSION, OPERATOR, SUMMARY, VECTORIZATION, WAIT	added: MERGE
Hive 3.0.0	removed: PRETTY added: ACTIVATE, ACTIVE, ALLOC_FRACTION, CHECK, DEFAULT, DO, ENFORCED, KILL, MANAGEMENT, MAPPING, MOVE, PATH, PLAN, PLANS, POOL, QUERY, QUERY_PARALLELISM, REOPTIMIZATION, RESOURCE, SCHEDULING_POLICY, UNMANAGED, WORKLOAD, ZONE	removed: PARTIALSCAN added: ANY, APPLICATION, DEC, NUMERIC, SYNC, TIME, TIMESTAMPLOCALTZ, UNIQUE
Hive 3.1.0	N/A	N/A
Hive 4.0.0	added: AST, AT, BRANCH, CBO, COST, CRON, DCPROPERTIES, DEBUG, DISABLED, DISTRIBUTED, ENABLED, EVERY, EXECUTE, EXECUTED, EXPIRE_SNAPSHOTS, IGNORE, JOINCOST, MANAGED, MANAGEDLOCATION, OPTIMIZE, REMOTE, RESPECT, RETAIN, RETENTION, SCHEDULED, SET_CURRENT_SNAPSHOT, SNAPSHOTS, SPEC, SYSTEM_TIME, SYSTEM_VERSION, TAG, TRANSACTIONAL, TRIM, TYPE, UNKNOWN, URL, WITHIN	added: COMPACTIONID, CONNECTOR, CONNECTORS, CONVERT, DDL, FORCE, LEADING, OLDER, PKFK_JOIN, PREPARE, QUALIFY, REAL, SOME, THAN, TRAILING
2023.1	added: TIME	removed: TIME

Accessing Hive 4

Specifying Hive 4 on the TD Console

Specifying Hive 4 using the TD Toolbelt

Specifying Hive 4 with the Python Client

New SQL Function support

Quote(str) function

JSON_READ(json,type) function

Sinh, cosh, tanh functions

Tumbling_window(timestamp)

CAST + FORMAT

Array type functions

percentile_cont, percentile_disc

approx_distinct

DataSketches functions

typeof

get_sql_schema

Hive 4 Limitations and Differences

ORDER BY Sorts Nulls Last

The mask_hash UDF Algorithm Has Changed.

Time Index Log Messages Have Changed

New Property to Disable Sorting of Subqueries

"True" and "False" Strings Resolve Differently

TIMESTAMP WITH LOCAL TIME ZONE is Supported

Cast NULL Literals in Common Table Expressions

Stateful UDFs Can Only Be Invoked in the SELECT List

Aliasing HAVING with CTE

Nested Parentheses in a Sub-query Don't Parse

YEAR UDF cannot be a numeric type

TD UDFs with a database name

Reserved Keywords of 2023.1

`ORDER BY` Sorts Nulls Last

The `mask_hash` UDF Algorithm Has Changed.

`TIMESTAMP WITH LOCAL TIME ZONE` is Supported

Aliasing `HAVING` with CTE