Java and Data Engineering
DATA ENGINEERING

Data Engineering and Programming Skills
When we think about Data engineering, the first programming skills that usually come to mind are SQL and maybe Python. SQL is this well-known language for querying data, deeply ingrained in the world of data and pipelines. Python, on the other hand, has become quite powerful in data science and is now making its mark in the evolving field of data engineering. But, is this common belief accurate? Are SQL and Python really the most important programming skills for Data Engineers? In this article, I'll share my experiences on this topic, aiming to help young professionals figure out the best skills to make the most of their time and energy.
Why Java and Scala?
In today's Data Engineering, we handle a massive amount of data. The main job is figuring out how to gather, change, and store this huge load of data every day, hour, or even in real-time. What makes it trickier is making sure different data services can smoothly run on various systems without worrying about what's happening underneath.
In the last 15 years, smart folks have come up with distributed computing frameworks to deal with this data overload. Hadoop and Spark are two big names in this game. Because both these frameworks are mainly built using JVM (Java Virtual Machine) languages (Hadoop uses Java, and Spark uses Scala), many data and software experts believe that Java and Scala are the way forward in data engineering.
Moreover, the ability of JVM applications to be portable makes them an excellent choice for data applications operating across diverse systems and environments. You can develop data pipelines that seamlessly run on various cloud and local setups, allowing you to scale your systems up or down without concerns about the underlying infrastructure.
How does a data pipeline look like in a JVM-based applications?
Now that we've explored the benefits of Java and Scala, or more broadly, JVM-based data applications, in handling big data, the next logical question is: what do these applications, or simply data pipelines, look like? This section aims to provide an overview of the architecture of such applications.
To begin, it's essential to develop a data pipeline in Java or Scala. Typically, multiple related data pipelines can coexist within the same Java or Scala project. For effective project management, tools like Apache Maven can be employed. Maven simplifies the creation, management, and building of Java applications, making the process more efficient and reliable.
In these projects, a data pipeline often comprises one or more Java or Scala classes. Spark is commonly integrated into these classes for tasks such as reading (or extracting), transforming, and writing (or loading) data. While data can be read from and written to various sources, Hive tables are often the natural choices. Standard transformations are encapsulated in common classes, making them reusable across different pipelines.
This code shows a basic data pipeline in Spark Scala.
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions._
class MyExampleDataPipeline {
val spark: SparkSession = SparkSession.builder
.appName("DataFrame Transformation Example")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
def main(inputTableName: String, outputTableName: String): Unit = {
val inputDataFrame: DataFrame = spark.table(inputTableName)
val transformedDataFrame: DataFrame = inputDataFrame.SOME_TRANSFORMATIONS
transformedDataFrame.write.mode("overwrite").format("parquet").saveAsTable(outputTableName)
}
}
Ultimately, the goal is to build a Java application typically in the form of a jar file. This jar file, along with appropriate arguments, can be invoked using job and workflow management systems like Apache Airflow. This enables the execution of specific data pipelines at scheduled intervals, contributing to an organized and automated data processing workflow.
Here's a simple example demonstrating how to execute a data pipeline, represented as a class, from the command line.
spark-submit --class MyExampleDataPipeline --master local[*] yourJarFile.jar your_input_hive_table_name your_output_hive_table_name
More advanced practices
As previously mentioned, data pipelines, now encapsulated in jar files, typically require scheduled execution, especially for batch processing, or activation based on triggered events, commonly for real-time processing. Apache Airflow serves as a robust solution for orchestrating these jar files and task classes, facilitating the execution of jobs on a regular schedule. Alternatively, jar files can be triggered using tools like AWS Lambda for irregular schedules and real-time processing.
This is an example of an Apache Airflow DAG designed to execute a Java class daily.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
jar_file_path = "/path/to/yourJarFile.jar"
input_table_name = "your_input_hive_table_name"
output_table_name = "your_output_hive_table_name"
default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 1, 1),
'depends_on_past': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'my_data_pipeline_dag',
default_args=default_args,
description='DAG to run the MyExampleDataPipeline class',
schedule_interval='@daily', # Adjust the schedule as needed
)
start_task = DummyOperator(task_id='start', dag=dag)
end_task = DummyOperator(task_id='end', dag=dag)
spark_submit_command = f"spark-submit --class MyExampleDataPipeline --master local[*] {jar_file_path} {input_table_name} {output_table_name}"
run_data_pipeline_task = BashOperator(
task_id='run_data_pipeline',
bash_command=spark_submit_command,
dag=dag,
)
start_task >> run_data_pipeline_task >> end_task
Moreover, Continuous Integration/Continuous Deployment (CI/CD) tools, including Jenkins, GitHub Actions, Spinnaker, among others, offer a seamless way to develop and deploy pipelines across various environments from development to testing and production environments. This ensures a smooth and automated transition of pipelines throughout the development lifecycle.
At the end …
We explored the evolving landscape of data engineering and the essential Programming skills required in this field. While SQL and Python have traditionally been associated with data engineering, the focus is shifting towards Java and Scala, particularly in the context of handling massive amounts of data through distributed computing frameworks like Hadoop and Spark.
Again, we emphasized the importance of JVM (Java Virtual Machine) languages due to their portability, making them suitable for developing data applications that seamlessly run across diverse systems and environments. It delves into the architecture of JVM-based data applications, illustrating the development of data pipelines using Java or Scala, with Apache Maven aiding project management.