Accessing Hive 4

Support for Hive 4 is now available on the Treasure Data console.

Specifying Hive 4 on the TD Console

After your account has been enabled for Hive 4 functionality, the drop-down menu for Type in the Queries window will include Hive 2023.1 (experimental). Selecting this option will cause TD to process your query with Hive 4.

image

Specifying Hive 4 using the TD Toolbelt

Using TD Toolbelt, you can specify a Hive 4 query by including the command line option
-T hive --engine-version experimental.

For example, a Hive 4 query would look something like this:

Copy
Copied
td query -w -d sample_datasets "SELECT COUNT(1) FROM www_access" -T hive --engine-version experimental

Specifying Hive 4 with the Python Client

Using the Python Client you can specify a Hive 4 query by specifying the engine_version as experimental. For example you can schedule a job using Hive 4 with the following Python code:

QueryCreate Schedule
Copy
Copied
import tdclient

with tdclient.Client() as td:
    job = td.query(
        "sample_datasets",
        "SELECT COUNT(1) FROM www_access",
        type="hive",
        engine_version="experimental",
    )

    job.wait()
    for row in job.result():
        print(repr(row))
Copy
Copied
import tdclient

with tdclient.Client() as td:
    td.create_schedule(
        "sched-example",
        {
            "type": "hive",
            "engine_version": "experimental",
            "query": "select count(1) from www_access",
            "database": "sample_datasets",
            "cron": "0 13 * * *",
        },
    )

New SQL Function support

TD has added support for the following SQL features in Hive 4:

Quote(str) function

Quotes a string to produce a result that can be used as a properly escaped data value in an SQL statement. The string is returned enclosed by single quotation marks and with each instance of backslash, single quote ('), ASCII NUL, or Control+Z preceded by a backslash.

If the argument is NULL, the return value is the word “NULL” without enclosing single quotation marks.

Copy
Copied
SELECT str, quote(str) FROM test_quote;

+-------------------+----------------------+
|        str        |         _c1          |
+-------------------+----------------------+
| Don't say "lazy"  | 'Don\'t say "lazy"'  |
+-------------------+----------------------+

JSON_READ(json,type) function

JSON_READ parses the given json according to the given complex type specification.

Copy
Copied
WITH parsed AS (
  SELECT json_read('[{"user_id": 1, "item_ids": [1, 5]}, {"user_id": 2, "item_ids": [2, 8]}]', 'array<struct<user_id:int,item_ids:array<int>>>') AS users
)
SELECT
  users[0].user_id AS user_id_0, users[0].item_ids AS item_ids_0,
  users[1].user_id AS user_id_0, users[1].item_ids AS item_ids_0,
  users
FROM parsed;

+------------+-------------+--------------+---------------+----------------------------------------------------+
| user_id_0  | item_ids_0  | user_id_0_1  | item_ids_0_1  |                       users                        |
+------------+-------------+--------------+---------------+----------------------------------------------------+
| 1          | [1,5]       | 2            | [2,8]         | [{"user_id":1,"item_ids":[1,5]},{"user_id":2,"item_ids":[2,8]}] |
+------------+-------------+--------------+---------------+----------------------------------------------------+

Sinh, cosh, tanh functions

The following hyperbolic built-in functions for MATH usage have been added: sinh, cosh, tanh.

Tumbling_window(timestamp)

This function rounds timestamp agr1 to the beginning of window interval where it belongs to.

Copy
Copied
SELECT tumbling_window(cast('2024-05-12 15:23:59' as timestamp), interval '2' MINUTES);

+------------------------+
|          _c0           |
+------------------------+
| 2024-05-12 15:22:00.0  |
+------------------------+

CAST + FORMAT

Introduced the FORMAT clause to CAST statements as well as to SQL:2016 datetime formats. Here is a limited list of datetime formats supported: YYYY, MM, DD, HH, MI, SS, YYY, YY, Y, RRRR, RR, DDD, HH12, HH24, SSSSS, FF[1..9], AM/A.M., PM/P.M., TZH, TZM.

Copy
Copied
SELECT
  cast(cast('2024-05-12 15:23:59' as timestamp) as string format 'yyyy-MM-dd'),
  cast('2024/05/12 15:23:59' as timestamp format 'yyyy/MM/dd HH24:MI:ss');

+-------------+------------------------+
|     _c0     |          _c1           |
+-------------+------------------------+
| 2024-05-12  | 2024-05-12 15:23:59.0  |
+-------------+------------------------+

Array type functions

  • array_min(array(obj1, obj2,...)): Returns the minimum value in an array with elements for which order is supported, returns null if array is empty
  • array_max(array(obj1, obj2,...)): Returns the maximum value in an array with elements for which order is supported, returns null if array is empty
  • array_distinct(array(obj1, obj2,...)): Returns an array of the same type as the input argument where all duplicate values have been removed.
  • array_join(array, delimiter, replaceNull): Concatenate the elements of an array with a specified delimiter Example:
  • array_slice(array, start, length): Returns the subset or range of elements from an array (subarray)
  • array_except(array1, array2): Returns an array of the elements in array1 but not in array2, without duplicates.
  • array_intersect(array1, array2): Returns an array of the elements in the intersection of array1 and array2, without duplicates.
  • array_union(array1, array2): Returns an array of the elements in the union of array1 and array2 without duplicates.
  • array_remove(array, element): Removes all occurrences of element from array.

percentile_cont, percentile_disc

Copy
Copied
SELECT
  percentile_cont(ws_sales_price, 0.5),
  percentile_disc(ws_sales_price, 0.5)
FROM web_sales;

approx_distinct

DataSketches functions

If approximate results are acceptable, there is a class of specialized algorithms called streaming algorithms, or sketches, that can produce results orders of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution.

Hive4 integrate with Apache Datasketches library more closely

  • HyperLogLog
    • ds _ hll _ estimate
    • ds _ hll _ estimate _ bounds
    • ds _ hll _ sketch
    • ds _ hll _ stringify
    • ds _ hll _ union
    • ds _ hll _ union _ f
  • Theta Sketch
    • ds _ theta _ estimate
    • ds _ theta _ exclude
    • ds _ theta _ intersect
    • ds _ theta _ intersect _ f
    • ds _ theta _ sketch
    • ds _ theta _ union
    • ds _ theta _ union _ f
  • Tuple Sketch
    • ds _ tuple _ arrayofdouble _ estimate
    • ds _ tuple _ arrayofdouble _ estimate _ bounds
    • ds _ tuple _ arrayofdouble _ means
  • etc...

typeof

This function returns a string describing the type of the argument.

Copy
Copied
SELECT typeof('text'), typeof(1), typeof(CAST(NULL AS bigint)), typeof(array(1.0));

+---------+------+---------+----------------------+
|   _c0   | _c1  |   _c2   |         _c3          |
+---------+------+---------+----------------------+
| string  | int  | bigint  | array<decimal(1,0)>  |
+---------+------+---------+----------------------+

get_sql_schema

This function helps to retrieve each column name and it's data type.

Copy
Copied
SELECT get_sql_schema('select ws_item_sk, max(ws_quantity) as max_quantity from web_sales group by ws_item_sk');

+---------------+-----------+
|   col_name    | col_type  |
+---------------+-----------+
| ws_item_sk    | bigint    |
| max_quantity  | int       |
+---------------+-----------+

Hive 4 Limitations and Differences

The following limitations or differences exist between the stable version of Hive 2 (Hive 2020.1) and Hive 4.

ORDER BY Sorts Nulls Last

While not necessarily a limitation, for ORDER_BY, the 2023.1 default ascending sort order (ASC) is NULLS LAST, and the default descending sort order (DESC) is NULLS FIRST.

Copy
Copied
ORDER BY name;

null            aaa
aaa      ->     bbb
bbb             null

And, to use traditional sorting you can use the following keyword:

Copy
Copied
ORDER BY name NULLS FIRST;

or

Copy
Copied
ORDER BY name NULLS LAST;

The mask_hash UDF Algorithm Has Changed.

The default mask has changed from MD5 to SHA256.

Time Index Log Messages Have Changed

In Hive 4, you will see a change in the number of log messages related to time index filtering in output log. These messages will look something like this:

Copy
Copied
** WARNING: time index filtering is not set on suprith_test.million_partitions!
** This query could be very slow as a result.
** If you used 'unix_timestamp' please modify your query to use TD_SCHEDULED_TIME instead
**   or rewrite the condition using TD_TIME_RANGE
** Please see https://docs.treasuredata.com/display/public/PD/Hive+Performance+Tuning#HivePerformanceTuning-LeveragingTime-basedPartitioning

New Property to Disable Sorting of Subqueries

Hive3 added the hive.remove.orderby.in.subquery property to disable sorting in subqueries (see HIVE-6348.This feature continues in Hive 4. Functions that rely on UDF sorting, such as TD_SESSIONIZE are not affected by this change.

"True" and "False" Strings Resolve Differently

The strings "true" and "false" can now be evaluated as TRUE/FALSE using the AbstractGenericUDAFResolver. However, strings such as "aaa" still return true, but functions like TD_AVGIF and TD_SUMIF will return different results given “false” as an argument.

Hive 2 TD_SUMIFHive 4 TD_SUMIF
Copy
Copied
SELECT
  TD_SUMIF(1, "true"),   --> 1
  TD_SUMIF(1, "false"),  --> 1
  TD_SUMIF(1, "aaa"),    --> 1
Copy
Copied
SELECT
  TD_SUMIF(1, "true"),   --> 1
  TD_SUMIF(1, "false"),  --> null
  TD_SUMIF(1, "aaa"),    --> 1

TIMESTAMP WITH LOCAL TIME ZONE is Supported

While this feature is support, be aware that it is currently impossible to change the default time zone used during the execution of the query. Therefore, the “local” time zone used will always be GMT+0 (UTC).

Cast NULL Literals in Common Table Expressions

In some cases, common table expressions (CTEs) can fail if you use a null literal in a SELECT statement. For example, here is CTE that fails:

Copy
Copied
WITH x AS (SELECT null AS mofu, 'POST' AS method)
SELECT t1.method
FROM sample_datasets.www_access t1
WHERE EXISTS (SELECT 1 FROM x t2 WHERE t1.method=t2.method)
UNION ALL
SELECT x.method FROM x

ql.Driver: FAILED: SemanticException [Error 10305]: CREATE-TABLE-AS-SELECT creates a VOID type, please use CAST to specify the type, near field:  mofu
org.apache.hadoop.hive.ql.parse.SemanticException: CREATE-TABLE-AS-SELECT creates a VOID type, please use CAST to specify the type, near field:  mofu

You should explicitly cast the null literal to the expected type. For example, in this CTE null is cast to type string:

Copy
Copied
WITH x AS (SELECT CAST(null AS STRING) AS mofu, 'POST' AS method)

Stateful UDFs Can Only Be Invoked in the SELECT List

Stateful UDFs such as TD_X_RANK cannot be used anywhere other than in the SELECT list. For example, specifying TD_X_RANK in the HAVING clause causes this query to fail:

Copy
Copied
SELECT col1, col2 FROM table
HAVING TD_X_RANK(col2) < 10

FAILED: SemanticException [Error 10084]: Stateful UDF's can only be invoked in the SELECT list
org.apache.hadoop.hive.ql.parse.SemanticException: Stateful UDF's can only be invoked in the SELECT list

To fix this you use a window function with a QUALIFY filter.

Window function in the select list with a QUALIFY clauseWindow function in the QUALIFY clause
Copy
Copied
SELECT
  col1, col2,
  ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2 DESC) AS rank
FROM table
QUALIFY
  rank < 10
Copy
Copied
SELECT
  col1, col2,
FROM table
QUALIFY
  ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2 DESC) < 10

Aliasing HAVING with CTE

In some cases, due to query compilation optimizations introduced in Hive 2023.1, HAVING clauses that reference aliases inside CTEs no longer work.

Copy
Copied
WITH x (
  SELECT col1, COUNT(*) AS num FROM table1
  GROUP BY 1
  HAVING NUM > 10
)
SELECT * FROM table2 WHERE col1 IN (SELECT col1 FROM x)

FAILED: CalciteSubquerySemanticException Encountered Select alias 'num' in having clause 'num > 10' This non standard behavior is not supported with cbo on. Turn off cbo for these queries.

Rewrite your CTE so the HAVING clause does not reference an alias.

Copy
Copied
WITH x (
  SELECT col1, COUNT(*) AS num FROM table1
  GROUP BY 1
  HAVING COUNT(*) > 10
)
SELECT * FROM table2 WHERE col1 IN (SELECT col1 FROM x)

Nested Parentheses in a Sub-query Don't Parse

Nested parentheses within subqueries are no longer allowed in Hive 2023.1.

Copy
Copied
SELECT col1, AVG(col2)
FROM (
  (
    SELECT col1, col2 FROM table WHERE time = 252691200
  )
) sub
GROUP BY col1

org.apache.hadoop.hive.ql.parse.ParseException: line 6:0 cannot recognize input near ')' 'sub' 'GROUP' in subquery source

Rewrite your subquery without nesting parentheses.

Copy
Copied
SELECT col1, AVG(col2)
FROM (
  SELECT col1, col2 FROM table WHERE time = 252691200
) sub
GROUP BY col1

YEAR UDF cannot be a numeric type

2023.1 throws a syntax error when you specify a numeric value for YEAR. The same applies to other UDFs(e.g. MONTH, DAY, HOUR, MINUTE, SECOND and WEEKOFYEAR) that handle dates. The value you specify for these UDFs must be a timestamp string.

Copy
Copied
SELECT YEAR(2024)

FAILED: SemanticException [Error 10016]: Line 1:12 Argument type mismatch '2024': year does not take INT type

To resolve this issue specify a valid timestamp string.

Copy
Copied
SELECT YEAR('2024-01-01')

TD UDFs with a database name

{any db name}.TD_••• succeeds with 2020.1, but fails with 2023.1.

Here is an example of how TD_URL_DECODE with a database name fails in 2023.1:

Copy
Copied
SELECT sample_datasets.TD_URL_DECODE(path) FROM sample_datasets.www_access

FAILED: SemanticException [Error 10011]: Invalid function sample_datasets.TD_URL_DECODE

You can avoid the problem by removing the database name from the UDF identifier.

Copy
Copied
SELECT TD_URL_DECODE(path) FROM sample_datasets.www_access

Reserved Keywords of 2023.1

Hive 2023.1 introduced the following reserved keywords. If you use any of these keywords as table names or column names, you will need to quote them with backticks.

  • ANY
  • APPLICATION
  • COMPACTIONID
  • CONNECTORa
  • CONNECTORS
  • CONVERT
  • DDL
  • DEC
  • FORCE
  • LEADING
  • NUMERIC
  • OLDER
  • PKFK _ JOIN
  • PREPARE
  • QUALIFY
  • REAL
  • SOME
  • SYNC
  • TIMESTAMPLOCALTZ
  • THAN
  • TRAILING
  • UNIQUE

The keywords for Hive 2023.1 are almost identical to the Keywords for Hive 4.0.0 except for TIME which is non-reserved in the Treasure Data environment.

Version Non-reserved Keywords Reserved Keywords
Hive 1.2.0 ADD, ADMIN, AFTER, ANALYZE, ARCHIVE, ASC, BEFORE, BUCKET, BUCKETS, CASCADE, CHANGE, CLUSTER, CLUSTERED, CLUSTERSTATUS, COLLECTION, COLUMNS, COMMENT, COMPACT, COMPACTIONS, COMPUTE, CONCATENATE, CONTINUE, DATA, DATABASES, DATETIME, DAY, DBPROPERTIES, DEFERRED, DEFINED, DELIMITED, DEPENDENCY, DESC, DIRECTORIES, DIRECTORY, DISABLE, DISTRIBUTE, ENABLE, ESCAPED, EXCLUSIVE, EXPLAIN, EXPORT, FIELDS, FILE, FILEFORMAT, FIRST, FORMAT, FORMATTED, FUNCTIONS, HOLD_DDLTIME, HOUR, IDXPROPERTIES, IGNORE, INDEX, INDEXES, INPATH, INPUTDRIVER, INPUTFORMAT, ITEMS, JAR, KEYS, LIMIT, LINES, LOAD, LOCATION, LOCK, LOCKS, LOGICAL, LONG, MAPJOIN, MATERIALIZED, METADATA, MINUS, MINUTE, MONTH, MSCK, NOSCAN, NO_DROP, OFFLINE, OPTION, OUTPUTDRIVER, OUTPUTFORMAT, OVERWRITE, OWNER, PARTITIONED, PARTITIONS, PLUS, PRETTY, PRINCIPALS, PROTECTION, PURGE, READ, READONLY, REBUILD, RECORDREADER, RECORDWRITER, REGEXP, RELOAD, RENAME, REPAIR, REPLACE, REPLICATION, RESTRICT, REWRITE, RLIKE, ROLE, ROLES, SCHEMA, SCHEMAS, SECOND, SEMI, SERDE, SERDEPROPERTIES, SERVER, SETS, SHARED, SHOW, SHOW_DATABASE, SKEWED, SORT, SORTED, SSL, STATISTICS, STORED, STREAMTABLE, STRING, STRUCT, TABLES, TBLPROPERTIES, TEMPORARY, TERMINATED, TINYINT, TOUCH, TRANSACTIONS, UNARCHIVE, UNDO, UNIONTYPE, UNLOCK, UNSET, UNSIGNED, URI, USE, UTC, VIEW, WHILE, YEAR ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH, BY, CASE, CAST, CHAR, COLUMN, CONF, CREATE, CROSS, CUBE, CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DECIMAL, DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED, EXTERNAL, FALSE, FETCH, FLOAT, FOLLOWING, FOR, FROM, FULL, FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE, NONE, NOT, NULL, OF, ON, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN, PARTITION, PERCENT, PRECEDING, PRESERVE, PROCEDURE, RANGE, READS, REDUCE, REVOKE, RIGHT, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT, TABLE, TABLESAMPLE, THEN, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE, UNBOUNDED, UNION, UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, WHEN, WHERE, WINDOW, WITH
Hive 2.0.0 removed: HOLD_DDLTIME, IGNORE, NO_DROP, OFFLINE, PROTECTION, READONLY, REGEXP, RLIKE
added: AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE
added: COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START
Hive 2.1.0 added: ABORT, KEY, LAST, NORELY, NOVALIDATE, NULLS, RELY, VALIDATE added: CACHE, CONSTRAINT, FOREIGN, PRIMARY, REFERENCES
Hive 2.2.0 removed: MINUS
added: CACHE, DAYS, DAYOFWEEK, DUMP, HOURS, MATCHED, MERGE, MINUTES, MONTHS, QUARTER, REPL, SECONDS, STATUS, VIEWS, WEEK, WEEKS, YEARS
removed: CACHE
added: EXCEPT, EXTRACT, FLOOR, GROUPING, INTEGER, MINUS, PRECISION
Hive 2.3.0 removed: MERGE
added: DETAIL, EXPRESSION, OPERATOR, SUMMARY, VECTORIZATION, WAIT
added: MERGE
Hive 3.0.0 removed: PRETTY
added: ACTIVATE, ACTIVE, ALLOC_FRACTION, CHECK, DEFAULT, DO, ENFORCED, KILL, MANAGEMENT, MAPPING, MOVE, PATH, PLAN, PLANS, POOL, QUERY, QUERY_PARALLELISM, REOPTIMIZATION, RESOURCE, SCHEDULING_POLICY, UNMANAGED, WORKLOAD, ZONE
removed: PARTIALSCAN
added: ANY, APPLICATION, DEC, NUMERIC, SYNC, TIME, TIMESTAMPLOCALTZ, UNIQUE
Hive 3.1.0 N/A N/A
Hive 4.0.0 added: AST, AT, BRANCH, CBO, COST, CRON, DCPROPERTIES, DEBUG, DISABLED, DISTRIBUTED, ENABLED, EVERY, EXECUTE, EXECUTED, EXPIRE_SNAPSHOTS, IGNORE, JOINCOST, MANAGED, MANAGEDLOCATION, OPTIMIZE, REMOTE, RESPECT, RETAIN, RETENTION, SCHEDULED, SET_CURRENT_SNAPSHOT, SNAPSHOTS, SPEC, SYSTEM_TIME, SYSTEM_VERSION, TAG, TRANSACTIONAL, TRIM, TYPE, UNKNOWN, URL, WITHIN added: COMPACTIONID, CONNECTOR, CONNECTORS, CONVERT, DDL, FORCE, LEADING, OLDER, PKFK_JOIN, PREPARE, QUALIFY, REAL, SOME, THAN, TRAILING
2023.1 added: TIME removed: TIME