Must-Know Techniques for Handling Big Data in Hive

In most tech companies, data teams must possess strong capabilities to manage and process Big Data. As a result, familiarity with the Hadoop ecosystem is essential for these teams. Hive Query Language (HQL), developed by Apache, is a powerful tool for data professionals to manipulate, query, transform, and analyze data within this ecosystem.
HQL offers a SQL-like interface, making data processing in Hadoop both accessible and user-friendly for a broad range of users. If you're already proficient in SQL, you'll likely find it not challenging to transition to HQL. However, it's important to note that HQL includes quite a few unique functions and features that aren't available in standard SQL. In this article, I'll explore some of these key HQL functions and features that require specific knowledge beyond SQL based on my previous experience. Understanding and utilizing these capabilities is critical for anyone working with Hive and big data, as they form the backbone of building scalable and efficient data processing pipelines and analytics systems in the Hadoop ecosystem. To illustrate these concepts, I'll provide use cases with mock data and the corresponding HQL syntax to demonstrate how these features can be applied effectively.
PARTITIONED BY
Let's imagine an e-commerce company that needs to update its sales data on a daily basis. If all the data is stored in a single dataset without partitioning, the table can become extremely large over a few months, significantly reducing query performance.
Apache Hive uses the PARTITIONED BY
clause to create partitioned tables, allowing queries to skip irrelevant subsets of the data and improving efficiency.
In this example, I first create a table named ecom_sales
, which includes columns for SKU_id
—the unique identifier for each SKU, units_sold
—the number of units sold for a specific SKU, and price_USD
—the price per unit. The table also includes a partitioning column ymd
. Next, I insert values into the partition where ymd = 20240801
. Finally, I query all data from this partition.
-- Create a partitioned table for sales
CREATE TABLE IF NOT EXISTS ecom_sales (
SKU_id STRING,
unit_sold INT,
price_USD DOUBLE
)
PARTITIONED BY (ymd INT)
STORED AS PARQUET;
-- Insert data into the partitioned table dated August 1st, 2024
INSERT INTO ecom_sales PARTITION (ymd=20240801)
VALUES
('S0002D', 5, 15.99),
('S0002D', 6, 15.99),
('S0001D', 12, 7.99),
('S0003D', 10, 21.50),
('S0001D', 9, 7.99),
('S0001D', 10, 7.99);
-- Query all data from the partition table dated August 1st, 2024
SELECT * FROM ecom_sales
WHERE ymd=20240801;
Partitioning in Hive undoubtedly leads to faster queries by avoiding full table scans. It also makes it easier to locate and modify specific records. With partitioning, you can prevent concurrency issues by running operations (such as UPDATE
and DELETE
) on non-overlapping partitions.
However, it's important to be cautious about small file problems. It's better to create 100 partitions with 1 GB each rather than 10,000 partitions with 0.01 GB each. Small files can cause several issues:
· Reduced Query Performance: A large number of partitions increases the load on the Hive metastore due to the handling of excessive metadata, which slows down query planning and execution. Hadoop's distributed processing is optimized for large files, and small files can lead to excessive disk I/O, network overhead, and unnecessary MapReduce tasks.
· Wasted Storage Space: In HDFS, each file, regardless of its size, occupies a full block of storage, typically 128 MB or 256 MB. Small files result in inefficient use of storage space.
· Management Challenges: Managing large numbers of small files complicates tasks like optimization, data consolidation, compaction, and rebalancing, making maintenance more difficult.
STORED AS
Hive uses the STORED AS
clause to specify the file format for table storage. The choice of file format can significantly impact query performance and storage efficiency.
In the previous section on the PARTITIONED BY
clause, I created a partitioned table named ecom_sales
and stored it in the PARQUET format.
-- Create a partitioned table for sales with file format PARQUET
CREATE TABLE IF NOT EXISTS ecom_sales (
SKU_id STRING,
unit_sold INT,
price_USD DOUBLE
)
PARTITIONED BY (ymd INT)
STORED AS PARQUET;
In addition to PARQUET, Hive supports several other file formats:
- PARQUET: A columnar storage format optimized for analytical workloads and complex queries.
- ORC (Optimized Row Columnar): Ideal for write-heavy workloads and transactional processing due to its high compression and fast read/write performance.
- AVRO: A row-based format that supports both structured and unstructured data, making it suitable for schema evolution.
- TEXTFILE: A format for plain text, often used for CSV or TSV files. It is the default format if none is specified.
- JSONFILE: Used for storing data in JSON format.
Beyond these five commonly used formats, Hive also supports others such as: SEQUENCEFILE (flat file consisting of binary key/value pairs), RCFILE (Record Columnar File), INPUTFORMAT and OUTPUTFORMAT for custom file formats and Compressed formats that you can specify compression for some formats.
When creating a new table in Hive, it's essential to choose the appropriate format(s) to ensure efficient data ingestion, storage, querying, and processing.
DISTRIBUTE BY / CLUSTER BY
Hive employs DISTRIBUTE BY clause to distribute rows across reducers.
SET mapreduce.job.reduces=2;
SELECT * FROM ecom_sales
WHERE ymd=20240801
DISTRIBUTE by SKU_id;
In the example above, I set the number of reducers to 2. After applying the DISTRIBUTE BY
clause, all rows with the same SKU_id
will be processed by the same reducer, while different SKU_id
values may be sent to different reducers. For instance, rows with SKU_id
‘S0001D' will go to one reducer, ‘S0002D' to another reducer, and ‘S0003D' to yet another. Whether S0001D
and S0002D
end up on the same or different reducers depends on the hashing function employed by Hive.
You can also add a SORT BY
clause to the syntax, which sorts the data within each reducer.
SET mapreduce.job.reduces=2;
SELECT * FROM ecom_sales
WHERE ymd = 20240801
DISTRIBUTE BY SKU_id
SORT BY SKU_id;
The result will be as shown:

Alternatively, you can use CLUSTER BY
to combine the functionality of both DISTRIBUTE BY
and SORT BY
, producing the same output.
SET mapreduce.job.reduces=2;
SELECT * FROM ecom_sales
WHERE ymd = 20240801
CLUSTER BY SKU_id;
LATERAL VIEW with EXPLODE
The LATERAL VIEW combined with the EXPLODE function is a powerful feature in Hive for handling array or map data types. It allows you to flatten nested data structures into a more traditional relational format, making it easier to analyze complex data. To demonstrate this feature, I created another table named products
with the following columns: SKU_id
—the unique identifier for each SKU, SKU_name
—the product name associated with the SKU, and description
—an array containing descriptions of the product for that SKU.
CREATE TABLE products (
SKU_id STRING,
SKU_name STRING,
description ARRAY
);
INSERT INTO products VALUES
('S0001D', 'white_shirt_small',ARRAY('shirt','white','small')),
('S0002D', 'blue_pants_large',ARRAY('pants','blue','large')),
('S0003D', 'yellow_jacket_large',ARRAY('pants','blue','large'));
SELECT * FROM products;
The syntax above create a table which looks like:

Next, I use the EXPLODE function to break down the array description
into individual rows, where each element of the array becomes a separate row. LATERAL VIEW is then used to create a virtual table, featureTable
, for each row generated from the array in the original products
table.
SELECT p.SKU_id, p.SKU_name, feature
FROM products p
LATERAL VIEW EXPLODE(p.description) featureTable AS feature;
The querying result will look like:

COLLECT_SET
COLLECT_SET function is an aggregation function in Hive that collects unique elements into an array, effectively performing the reverse operation of a LATERAL VIEW
combined with the EXPLODE
function. The syntax below creates a table named prod_features
, where specific elements will be grouped into arrays.
CREATE TABLE prod_features(
SKU_id STRING,
SKU_name STRING,
feature STRING
);
INSERT INTO prod_features VALUES
('S0001D', 'white_shirt_small','shirt'),
('S0001D', 'white_shirt_small','white'),
('S0001D', 'white_shirt_small','small'),
('S0002D', 'blue_pants_large','pants'),
('S0002D', 'blue_pants_large','blue'),
('S0002D', 'blue_pants_large','large'),
('S0003D', 'yellow_jacket_large','jacket'),
('S0003D', 'yellow_jacket_large','blue'),
('S0003D', 'yellow_jacket_large','large');
SELECT * FROM prod_features;

The following syntax demonstrates how the COLLECT_SET function aggregates all unique values from the feature
column into an array.
SELECT SKU_id, SKU_name, COLLECT_SET(feature) AS description
FROM prod_features
GROUP BY SKU_id, SKU_name;
The querying result will appear as:

Conclusion
This article discussed about the 5 Hive-specific features that set Hive apart from SQL, making it particularly well-suited for big data processing and analytics.
- PARTITIONED BY enables efficient management of large datasets by optimizing data storage and retrieval.
- STORED AS improves storage efficiency and query performance through the use of specified file formats tailored to the workload.
- DISTRIBUTE BY and CLUSTER BY are critically important for data query performance optimization in distributed systems due to the granular control over the distribution of data across reducers during parallel processing.
- LATERAL VIEW with EXPLODE transforms array or map data types into individual rows, facilitating detailed analysis of nested data structures.
- COLLECT_SET aggregates individual elements into arrays, providing a powerful tool for summarizing and grouping data.
By leveraging these Hive-specific capabilities, data professionals can establish more effective and robust big data solutions, tackling challenges that would be difficult or impossible with standard SQL alone.