Writing Hive Queries

The following information can help in writing Hive queries.

Hive-SELECT-Statement-Syntax

Here’s the syntax of Hive’s SELECT statement.

Copy
Copied
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number]
;

SELECT is the projection operator in HiveQL. The points are:

  • SELECT scans the table specified by the FROM clause
  • WHERE gives the condition of what to filter
  • GROUP BY gives a list of columns which specify how to aggregate the records
  • CLUSTER BY , DISTRIBUTE BY , SORT BY specify the sort order and algorithm
  • LIMIT specifies how many # of records to retrieve

Computing with Columns

When you select the columns, you can manipulate column values using either arithmetic operators or function calls. Math, date, and string functions are also popular.

Here’s an example query that uses both operators and functions.

Copy
Copied
SELECT upper(name), sales_cost FROM products;

WHERE Clauses

A WHERE clause is used to filter the result set by using predicate operators and logical operators. Functions can also be used to compute the condition.

Here’s an example query that uses a WHERE clause.

Copy
Copied
SELECT name FROM products WHERE name = 'stone of jordan';

GROUP BY Clauses

A GROUP BY clause is frequently used with aggregate functions, to group the result set by columns and apply aggregate functions over each group. Functions can also be used to compute the grouping key.

Here’s an example query that groups and counts by category.

Copy
Copied
SELECT category, count(1) FROM products GROUP BY category;

HAVING Clauses

A HAVING clause lets you filter the groups produced by GROUP BY, by applying predicate operators to each groups.

Here’s an example query that groups and counts by category, and then retrieves only counts > 10;

Copy
Copied
SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt > 10;

Example Queries

Here are some basic examples. The underlying table consists of three fields: IP, url, and time.

Copy
Copied
# Number of Records
SELECT COUNT(1) FROM www_access;

# Number of Unique IPs
SELECT COUNT(1) FROM ( \
  SELECT DISTINCT ip FROM www_access \
) t;

# Number of Unique IPs that Accessed the Top Page
SELECT COUNT(distinct ip) FROM www_access \
  WHERE url='/';

# Number of Accesses per Unique IP
SELECT ip, COUNT(1) FROM www_access \
  GROUP BY ip LIMIT 30;

# Unique IPs Sorted by Number of Accesses
SELECT ip, COUNT(1) AS cnt FROM www_access \
  GROUP BY ip
  ORDER BY cnt DESC LIMIT 30;

# Number of Accesses After a Certain Time
SELECT COUNT(1) FROM www_access \
  WHERE TD_TIME_RANGE(time, "2011-08-19", "PDT", NULL)

# Number of Accesses Each Day
SELECT \
  TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT") AS day, \
  COUNT(1) AS cnt \
FROM www_access \
GROUP BY TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT")

TD_TIME_RANGE UDF is efficient to use. See Using TD_TIME_RANGE

info

The time column is a special column that is always present and stores the UNIX timestamp of the log.

INSERT Statement Syntax

Here’s the syntax of Hive’s INSERT statement.

Copy
Copied
-- append new rows to tablename1
INSERT INTO TABLE tablename1 select_statement1 FROM from_statement; 

-- replace contents of tablename1
INSERT OVERWRITE TABLE tablename1 select_statement1 FROM from_statement; 

-- more complex example using WITH clause
WITH tablename1 AS (select_statement1 FROM from_statement) INSERT [OVERWRITE/INTO] TABLE tablename2 select_statement2 FROM tablename1;
  • INSERT INTO will append to the table, keeping the existing data intact.
  • If the record doesn’t include a time column, the time column is imported using TD _ SCHEDULED _ TIME().
  • If the record includes a time column, time column should be Unixtime.
  • All INSERT INTO statements require the FROM clause.

Restrictions

  • All column aliases used in INSERT...SELECT statement should use a valid SQL column name to avoid failures setting the schema.
  • Treasure Data HiveQL does not support Hive Multi-Table Inserts.
  • On tables NOT receiving streaming updates, INSERT OVERWRITE will delete any existing data in the table and write the new rows.
  • INSERT OVERWRITE is not recommended for use on tables receiving streaming updates. INSERT OVERWRITE will not delete recently received streaming update rows or updates that arrive during the execution of INSERT OVERWRITE. If you need to perform INSERT OVERWRITE on a table that normally receives streaming updates, stop the streaming update before performing INSERT OVERWRITE.

Hive INSERT Syntax in Treasure Data

The Hive INSERT syntax conforms to ANSI SQL standards.

  • INSERT INTO will append to the table, keeping the existing data intact.
  • If the record doesn’t include a time column, the time column is imported usibg TD SCHEDULED TIME().
  • If the record includes a time column, time column should be Unixtime.
  • All INSERT INTO statements require the FROM clause.

Restrictions

  • All column aliases used in INSERT...SELECT statement should use a valid SQL column name to avoid failures setting the schema.
  • On tables NOT receiving streaming updates, INSERT OVERWRITE will delete any existing data in the table and write the new rows.
  • INSERT OVERWRITE is not recommended for use on tables receiving streaming updates. INSERT OVERWRITE will not delete recently received streaming update rows or updates that arrive during the execution of INSERT OVERWRITE. If you need to perform INSERT OVERWRITE on a table that normally receives streaming updates, stop the streaming update before performing INSERT OVERWRITE.
  • Partitioning and INSERT: Because TD implements its own approaches to partitioning data (time-based and user-defined partitioning), the standard Hive INSERT INTO... PARTITION clause is not supported.

Syntax

Copy
Copied
INSERT OVERWRITE TABLE tablename1 select_statement1 FROM from_statement;
Copy
Copied
INSERT INTO TABLE tablename1 select_statement1 FROM from_statement;

Examples

Append new rows to table_b.

Copy
Copied
INSERT INTO TABLE table_b SELECT col1, col2 FROM table_a; 

Replace contents of table_b

Copy
Copied
INSERT OVERWRITE TABLE table_b  SELECT col1, col2 FROM table_a; 

Example using WITH clause:

Copy
Copied
WITH table_a_tmp AS
(
SELECT col1, col2, 'test1' as col3
  FROM table_a
)
INSERT [OVERWRITE/INTO] TABLE table_b
SELECT col1, col2, col3
  FROM table_a_tmp;

Native UDFs on Hive

All native UDFs on Hive can also be used on Treasure Data.

Nested JSON records can be parsed by using Hive’s get_json_object UDF function.

However, Treasure Data recommends that you maintain a flat JSON structure to avoid additional CPU performance overhead.

Compute Column Values with TD Hive Reference Links

SQL lets you manipulate compute new values using arithmetic operators or function calls. For more details see Hive Arithemtic Operators or Hive Mathematical Functions.

Hive Built-in Operators

Built-in operators include:

  • Operators Precedences
  • Relational Operators
  • Arithmetic Operators
  • Logical Operators
  • String Operators
  • Complex Type Constructors
  • Operators on Complex Types

For detailed information on these operators see Hive Operators.

Hive Reserved and Non-Reserved Keywords

For a list of reserved and non-reserved keywords see Keywords