Must-Know Techniques for Handling Big Data in Hive

Author:Murphy  |  View: 22219  |  Time: 2025-03-23 11:53:12
Image by Christopher Gower on Unsplash

In most tech companies, data teams must possess strong capabilities to manage and process Big Data. As a result, familiarity with the Hadoop ecosystem is essential for these teams. Hive Query Language (HQL), developed by Apache, is a powerful tool for data professionals to manipulate, query, transform, and analyze data within this ecosystem.

HQL offers a SQL-like interface, making data processing in Hadoop both accessible and user-friendly for a broad range of users. If you're already proficient in SQL, you'll likely find it not challenging to transition to HQL. However, it's important to note that HQL includes quite a few unique functions and features that aren't available in standard SQL. In this article, I'll explore some of these key HQL functions and features that require specific knowledge beyond SQL based on my previous experience. Understanding and utilizing these capabilities is critical for anyone working with Hive and big data, as they form the backbone of building scalable and efficient data processing pipelines and analytics systems in the Hadoop ecosystem. To illustrate these concepts, I'll provide use cases with mock data and the corresponding HQL syntax to demonstrate how these features can be applied effectively.


PARTITIONED BY

Let's imagine an e-commerce company that needs to update its sales data on a daily basis. If all the data is stored in a single dataset without partitioning, the table can become extremely large over a few months, significantly reducing query performance.

Apache Hive uses the PARTITIONED BY clause to create partitioned tables, allowing queries to skip irrelevant subsets of the data and improving efficiency.

In this example, I first create a table named ecom_sales, which includes columns for SKU_id—the unique identifier for each SKU, units_sold—the number of units sold for a specific SKU, and price_USD—the price per unit. The table also includes a partitioning column ymd. Next, I insert values into the partition where ymd = 20240801. Finally, I query all data from this partition.

-- Create a partitioned table for sales
CREATE TABLE IF NOT EXISTS ecom_sales (
    SKU_id STRING,
    unit_sold INT,
    price_USD DOUBLE
)
PARTITIONED BY (ymd INT)
STORED AS PARQUET;

-- Insert data into the partitioned table dated August 1st, 2024
INSERT INTO ecom_sales PARTITION (ymd=20240801)
VALUES 
    ('S0002D', 5, 15.99),
    ('S0002D', 6, 15.99),
    ('S0001D', 12, 7.99),    
    ('S0003D', 10, 21.50),
    ('S0001D', 9, 7.99),
    ('S0001D', 10, 7.99);

-- Query all data from the partition table dated August 1st, 2024
SELECT * FROM ecom_sales 
WHERE ymd=20240801;

Partitioning in Hive undoubtedly leads to faster queries by avoiding full table scans. It also makes it easier to locate and modify specific records. With partitioning, you can prevent concurrency issues by running operations (such as UPDATE and DELETE) on non-overlapping partitions.

However, it's important to be cautious about small file problems. It's better to create 100 partitions with 1 GB each rather than 10,000 partitions with 0.01 GB each. Small files can cause several issues:

· Reduced Query Performance: A large number of partitions increases the load on the Hive metastore due to the handling of excessive metadata, which slows down query planning and execution. Hadoop's distributed processing is optimized for large files, and small files can lead to excessive disk I/O, network overhead, and unnecessary MapReduce tasks.

· Wasted Storage Space: In HDFS, each file, regardless of its size, occupies a full block of storage, typically 128 MB or 256 MB. Small files result in inefficient use of storage space.

· Management Challenges: Managing large numbers of small files complicates tasks like optimization, data consolidation, compaction, and rebalancing, making maintenance more difficult.


STORED AS

Hive uses the STORED AS clause to specify the file format for table storage. The choice of file format can significantly impact query performance and storage efficiency.

In the previous section on the PARTITIONED BY clause, I created a partitioned table named ecom_sales and stored it in the PARQUET format.

-- Create a partitioned table for sales with file format PARQUET
CREATE TABLE IF NOT EXISTS ecom_sales (
    SKU_id STRING,
    unit_sold INT,
    price_USD DOUBLE
)
PARTITIONED BY (ymd INT)
STORED AS PARQUET;

In addition to PARQUET, Hive supports several other file formats:

  • PARQUET: A columnar storage format optimized for analytical workloads and complex queries.
  • ORC (Optimized Row Columnar): Ideal for write-heavy workloads and transactional processing due to its high compression and fast read/write performance.
  • AVRO: A row-based format that supports both structured and unstructured data, making it suitable for schema evolution.
  • TEXTFILE: A format for plain text, often used for CSV or TSV files. It is the default format if none is specified.
  • JSONFILE: Used for storing data in JSON format.

Beyond these five commonly used formats, Hive also supports others such as: SEQUENCEFILE (flat file consisting of binary key/value pairs), RCFILE (Record Columnar File), INPUTFORMAT and OUTPUTFORMAT for custom file formats and Compressed formats that you can specify compression for some formats.

When creating a new table in Hive, it's essential to choose the appropriate format(s) to ensure efficient data ingestion, storage, querying, and processing.


DISTRIBUTE BY / CLUSTER BY

Hive employs DISTRIBUTE BY clause to distribute rows across reducers.

SET mapreduce.job.reduces=2;
SELECT * FROM ecom_sales
WHERE ymd=20240801
DISTRIBUTE by SKU_id;

In the example above, I set the number of reducers to 2. After applying the DISTRIBUTE BY clause, all rows with the same SKU_id will be processed by the same reducer, while different SKU_id values may be sent to different reducers. For instance, rows with SKU_id ‘S0001D' will go to one reducer, ‘S0002D' to another reducer, and ‘S0003D' to yet another. Whether S0001D and S0002D end up on the same or different reducers depends on the hashing function employed by Hive.

You can also add a SORT BY clause to the syntax, which sorts the data within each reducer.

SET mapreduce.job.reduces=2;
SELECT * FROM ecom_sales 
WHERE ymd = 20240801 
DISTRIBUTE BY SKU_id
SORT BY SKU_id;

The result will be as shown:

Image by the author

Alternatively, you can use CLUSTER BY to combine the functionality of both DISTRIBUTE BY and SORT BY, producing the same output.

SET mapreduce.job.reduces=2;
SELECT * FROM ecom_sales 
WHERE ymd = 20240801 
CLUSTER BY SKU_id;

LATERAL VIEW with EXPLODE

The LATERAL VIEW combined with the EXPLODE function is a powerful feature in Hive for handling array or map data types. It allows you to flatten nested data structures into a more traditional relational format, making it easier to analyze complex data. To demonstrate this feature, I created another table named products with the following columns: SKU_id—the unique identifier for each SKU, SKU_name—the product name associated with the SKU, and description—an array containing descriptions of the product for that SKU.

CREATE TABLE products (
   SKU_id STRING,
   SKU_name STRING,
   description ARRAY
);

INSERT INTO products VALUES
   ('S0001D', 'white_shirt_small',ARRAY('shirt','white','small')),
   ('S0002D', 'blue_pants_large',ARRAY('pants','blue','large')),
   ('S0003D', 'yellow_jacket_large',ARRAY('pants','blue','large'));

SELECT * FROM products;

The syntax above create a table which looks like:

Image by the author

Next, I use the EXPLODE function to break down the array description into individual rows, where each element of the array becomes a separate row. LATERAL VIEW is then used to create a virtual table, featureTable, for each row generated from the array in the original products table.

SELECT p.SKU_id, p.SKU_name, feature
FROM products p
LATERAL VIEW EXPLODE(p.description) featureTable AS feature;

The querying result will look like:

Image by the author

COLLECT_SET

COLLECT_SET function is an aggregation function in Hive that collects unique elements into an array, effectively performing the reverse operation of a LATERAL VIEW combined with the EXPLODE function. The syntax below creates a table named prod_features, where specific elements will be grouped into arrays.

CREATE TABLE prod_features(
    SKU_id STRING,
    SKU_name STRING,
    feature STRING
);

INSERT INTO prod_features VALUES
    ('S0001D', 'white_shirt_small','shirt'),
    ('S0001D', 'white_shirt_small','white'),
    ('S0001D', 'white_shirt_small','small'),
    ('S0002D', 'blue_pants_large','pants'),
    ('S0002D', 'blue_pants_large','blue'),
    ('S0002D', 'blue_pants_large','large'),
    ('S0003D', 'yellow_jacket_large','jacket'),
    ('S0003D', 'yellow_jacket_large','blue'),
    ('S0003D', 'yellow_jacket_large','large');

SELECT * FROM prod_features;
prod_features (Image by the author)

The following syntax demonstrates how the COLLECT_SET function aggregates all unique values from the feature column into an array.

SELECT SKU_id, SKU_name, COLLECT_SET(feature) AS description
FROM prod_features
GROUP BY SKU_id, SKU_name;

The querying result will appear as:

Image by the author

Conclusion

This article discussed about the 5 Hive-specific features that set Hive apart from SQL, making it particularly well-suited for big data processing and analytics.

  • PARTITIONED BY enables efficient management of large datasets by optimizing data storage and retrieval.
  • STORED AS improves storage efficiency and query performance through the use of specified file formats tailored to the workload.
  • DISTRIBUTE BY and CLUSTER BY are critically important for data query performance optimization in distributed systems due to the granular control over the distribution of data across reducers during parallel processing.
  • LATERAL VIEW with EXPLODE transforms array or map data types into individual rows, facilitating detailed analysis of nested data structures.
  • COLLECT_SET aggregates individual elements into arrays, providing a powerful tool for summarizing and grouping data.

By leveraging these Hive-specific capabilities, data professionals can establish more effective and robust big data solutions, tackling challenges that would be difficult or impossible with standard SQL alone.

Tags: Big Data Data Science Hive Hql Programming

Comment