Seamless Data Analytics Workflow: From Dockerized JupyterLab and MinIO to Insights with Spark SQL

Author:Murphy | View: 29469 | Time: 2025-03-22 23:38:53

This tutorial guides you through an analytics use case, analyzing semi-structured data with Spark SQL. We'll start with the data engineering process, pulling data from an API and finally loading the transformed data into a data lake (represented by MinIO). Plus, we'll utilise Docker to introduce a best practice for setting up the environment. So, let's dive in and see how it's all done!

Understanding the building blocks
Setting up Docker Desktop
Configuring MinIO
Getting started with JupyterLab
Data pipeline: The ETL process
Analysing semi-structured data
Cleanup of resources

Understanding the building blocks

This tutorial involves a range of technologies. Before diving into the practical part, let's grasp each one. We'll use analogies to make understanding each component easier.

GitHub – sarthak-sarbahi/data-analytics-minio-spark

Imagine you're a captain setting sail across a vast ocean. In the world of data, this ocean is the endless stream of information flowing from various sources. Our ship? It's the suite of tools and technologies we use to navigate these waters.

JupyterLab and MinIO with Docker Compose: Just as a ship needs the right parts to set sail, our data journey begins with assembling our tools. Think of Docker Compose as our toolbox, letting us efficiently put together JupyterLab (our navigation chart) and MinIO (our storage deck). It's like building a custom vessel that's perfectly suited for the voyage ahead.
Fetching data with Python: Now, it's time to chart our course. Using Python is like casting a wide net into the sea to gather fish (our data). We carefully select our catch, pulling data through the API and storing it in JSON format – a way of organizing our fish so that it's easy to access and use later.
Reading and transforming data with PySpark: With our catch on board, we use PySpark, our compass, to navigate through this sea of data. PySpark helps us clean, organize, and make sense of our catch, transforming raw data into valuable insights, much like how a skilled chef would prepare a variety of dishes from the day's catch.
Analytics with Spark SQL: Finally, we dive deeper, exploring the depths of the ocean with Spark SQL. It's like using a sophisticated sonar to find hidden treasures beneath the waves. We perform analytics to uncover insights and answers to questions, revealing the valuable pearls hidden within our sea of data.

Now that we know what lies ahead in our journey, let's begin setting things up.

Setting up Docker Desktop

Docker is a tool that makes it easier to create, deploy, and run applications. Docker containers bundle up an application with everything it needs (like libraries and other dependencies) and ship it as one package. This means that the application will run the same way, no matter where the Docker container is deployed – whether it's on your laptop, a colleague's machine, or a cloud server. This solves a big problem: the issue of software running differently on different machines due to varying configurations.

In this guide, we're going to work with several Docker containers simultaneously. It's a typical scenario in real-world applications, like a web app communicating with a database. Docker Compose facilitates this. It allows us to start multiple containers, with each container handling a part of the application. Docker Compose ensures these components can interact with each other, enabling the application to function as an integrated unit.

To set up Docker, we use the Docker Desktop application. Docker Desktop is free for personal and educational use. You can download it from here.

Docker Desktop Application (Image by author)

After installing Docker Desktop, we'll begin with the tutorial. We'll start a new project in an Integrated Development Environment (IDE). You can choose any IDE you prefer. I'm using Visual Studio Code.

For this guide, I'm using a Windows machine with WSL 2 (Windows Subsystem for Linux) installed. This setup lets me run a Linux environment, specifically Ubuntu, on my Windows PC. If you're using Windows too and want to enable Docker Desktop for WSL 2, there's a helpful video you can watch.

Next, we'll create a docker-compose.yml file in the root directory of our project.

If this is your first time encountering a file like this, don't fret. I'll go into more detail about it in the following sections. For now, just run the command docker-compose up -d in the directory where this file is located. This command will initially fetch the Docker images for JupyterLab and MinIO from the Docker Hub.

Results of running the command (Image by author)

A Docker image is like a blueprint or a recipe for creating a Docker container. Think of it as a pre-packaged box that contains everything you need to run a specific software or application. This box (or image) includes the code, runtime, system tools, libraries, and settings – basically all the necessary parts that are required to run the application.

Containers are simply running instances of Docker images.

Docker Hub is like an online library or store where people can find and share Docker images.

Required images have been downloaded (Image by author)

After the images are downloaded, it will launch a container for each image. This process will initiate two containers – one for JupyterLab and another for MinIO.

Two containers are running (Image by author)

With the required processes now operational, let's dive deeper into MinIO and how it's configured.

Configuring MinIO

MinIO is an open-source object storage solution, specifically designed to handle large volumes and varieties of data. It's highly compatible with Amazon S3 APIs, which makes it a versatile choice for cloud-native applications.

MinIO is much like using a ‘free' version of Amazon S3 on your PC.

We'll utilize MinIO for storing both raw and processed data, mimicking a real-world scenario. Thanks to Docker, we already have MinIO up and running. Next, we need to learn how to use it. But first, let's revisit the docker-compose.yml file.

The services section in the file outlines the containers we'll run and the software instances they will initiate. Our focus here is on the MinIO service.

MinIO service in **docker-compose.yml** file (Image by author)

Let's break this down.

image: minio/minio tells Docker to use the MinIO image from Docker Hub (the online library of Docker images).
container_name: minio1 gives a name to this container, in this case, minio1.
ports: - "9000:9000" - "9001:9001" maps the ports from the container to your host machine. This allows you to access the MinIO service using these ports on your local machine.
volumes: - /mnt/data:/data sets up a volume, which is like a storage space, mapping a directory on your host machine (/mnt/data) to a directory in the container (/data). This means MinIO will use the /mnt/data directory on your machine to store the data.
environment: section sets environment variables inside the container. Here, it's setting the MinIO root user's username and password.
command: server /data --console-address ":9001" is the command that will be run inside the MinIO container. It starts the MinIO server and tells it to use the /data directory.

With MinIO's setup clear, let's begin using it. You can access the MinIO web interface at [http://localhost:9001](http://localhost:9001/). On your initial visit, you'll need to log in with the username (minio)and password (minio123) specified in the docker-compose file.

Once logged in, go ahead and create a bucket. Click on ‘Create a Bucket' and name it mybucket. After naming it, click on ‘Create Bucket.' The default settings are fine for now, but feel free to read about them on the page's right side.

Bucket created in MinIO (Image by author)

Well done! We're now ready to use MinIO. Let's move on to exploring how we can use JupyterLab.

Getting started with JupyterLab

JupyterLab is an interactive web-based interface that helps us write code, perform analysis on notebooks and work with data. In fact, the JupyterLab image already includes Python and PySpark, so there's no hassle in setting them up.

JupyterLab service in **docker-compose.yml** file (Image by author)

First, let's revisit the docker-compose.yml file to understand the jupyter service.

image: jupyter/pyspark-notebook specifies to use the JupyterLab image that comes with PySpark pre-installed.
ports: - "8888:8888" maps the JupyterLab port to the same port on your host machine, allowing you to access it through your browser.

To access its web interface, navigate to the ‘Containers' tab in the Docker Desktop application. Find and click on the JupyterLab container, labeled jupyter-1. This action will display the container logs.

JupyterLab container logs (Image by author)

Within these logs, you'll find a URL resembling this: [http://127.0.0.1:8888/lab?token=4f1c9d4f1aeb460f1ebbf224dfa9417c88eab1691fa64b04](http://127.0.0.1:8888/lab?token=4f1c9d4f1aeb460f1ebbf224dfa9417c88eab1691fa64b04). Clicking on this URL launches the web interface.

JupyterLab web interface (Image by author)

Once there, select the ‘Python 3 (ipykernel)' icon under the ‘Notebook' section. This action opens a new notebook, where we'll write code for data retrieval, transformation, and analysis. Before diving into coding, remember to save and name your notebook appropriately. And there you have it, we're ready to start working with the data.

Data pipeline: The ETL process

Before diving into Data Analysis, we first need to gather the data. We'll employ an ETL (Extract, Transform, Load) process, which involves the following steps:

Initially, we'll extract data using a public API.
Then, we'll load this data as a JSON file into the MinIO bucket.
After that, we'll use PySpark to transform the data and save it back to the bucket in Parquet format.
Lastly, we'll create a Hive table from this Parquet data, which we'll use for running Spark SQL queries for analysis.

First up, we need to install the s3fs Python package, essential for working with MinIO in Python.

!pip install s3fs

Following that, we'll import the necessary dependencies and modules.

import requests
import json
import os
import s3fs
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as F

We'll also set some environment variables that will be useful when interacting with MinIO.

# Define environment variables
os.environ["MINIO_KEY"] = "minio"
os.environ["MINIO_SECRET"] = "minio123"
os.environ["MINIO_ENDPOINT"] = "http://minio1:9000"

Next, we'll fetch data from the public API using the requests Python package. We're using the open-source Rest Countries Project. It gives information about the different countries of the world – area, population, capital city, time zones, etc. Click here to learn more about it.

# Get data using REST API
def fetch_countries_data(url):
    # Using session is particularly beneficial 
    # if you are making multiple requests to the same server, 
    # as it can reuse the underlying TCP connection, 
    # leading to performance improvements.
    with requests.Session() as session:
        response = session.get(url)
        response.raise_for_status()

        if response.status_code == 200:
            return response.json()
        else:
            return f"Error: {response.status_code}"

# Fetch data
countries_data = fetch_countries_data("https://restcountries.com/v3.1/all")

Once we have the data, we'll write it as a JSON file to the mybucket bucket.

# Write data to minIO as a JSON file

fs = s3fs.S3FileSystem(
    client_kwargs={'endpoint_url': os.environ["MINIO_ENDPOINT"]}, # minio1 = minio container name
    key=os.environ["MINIO_KEY"],
    secret=os.environ["MINIO_SECRET"],
    use_ssl=False  # Set to True if MinIO is set up with SSL
)

with fs.open('mybucket/country_data.json', 'w', encoding='utf-8') as f:
    json.dump(countries_data,f)

Great, we've successfully retrieved the data! Now, it's time to initialize a Spark session for running PySpark code. If you're new to Spark, understand that it's a big data processing framework that operates on distributed computing principles, breaking data into chunks for parallel processing. A Spark session is essentially the gateway to any Spark application.

spark = SparkSession.builder 
    .appName("country_data_analysis") 
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.11.1026") 
    .config("spark.hadoop.fs.s3a.endpoint", os.environ["MINIO_ENDPOINT"]) 
    .config("spark.hadoop.fs.s3a.access.key", os.environ["MINIO_KEY"]) 
    .config("spark.hadoop.fs.s3a.secret.key", os.environ["MINIO_SECRET"]) 
    .config("spark.hadoop.fs.s3a.path.style.access", "true") 
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
    .enableHiveSupport() 
    .getOrCreate()

Let's simplify this to understand it better.

spark.jars.packages: Downloads the required JAR files from the Maven repository. A Maven repository is a central place used for storing build artifacts like JAR files, libraries, and other dependencies that are used in Maven-based projects
spark.hadoop.fs.s3a.endpoint: This is the endpoint URL for MinIO.
spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key: This is the access key and secret key for MinIO. Note that it is the same as the username and password used to access the MinIO web interface.
spark.hadoop.fs.s3a.path.style.access: It is set to true to enable path-style access for the MinIO bucket.
spark.hadoop.fs.s3a.impl: This is the implementation class for S3A file system.

You might wonder how to choose the correct JAR version. It depends on compatibility with the PySpark and Hadoop versions we use. Here's how to check your PySpark and Hadoop versions (Hadoop is another open-source framework for working with big data).

# Check PySpark version
print(pyspark.__version__)

# Check Hadoop version
sc = SparkContext.getOrCreate()
hadoop_version = sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()
print("Hadoop version:", hadoop_version)

Choosing the right JAR version is crucial to avoid errors. Using the same Docker image, the JAR version mentioned here should work fine. If you encounter setup issues, feel free to leave a comment. I'll do my best to assist you

Tags: Apache Spark Data Analysis Data Engineering Docker Sql