An Intuitive Guide to Docker for Data Science

Author:Murphy  |  View: 25268  |  Time: 2025-03-22 21:43:02
Photo by [Unsplash](https://unsplash.com/photos/top-view-of-mature-man-with-boxes-moving-in-new-house-sitting-and-unpacking-xIjjHU8UuPY)+ on Unsplash

While working as data scientist, it's important to write code that runs in any operating system with all the necessary dependencies, and is ready to be deployed on the cloud. Despite your efforts, it may still not work and you may lose time to understand what the problem is.

What tool can we use to avoid this struggling? Docker is the solution to your problems. Using docker, you can easily obtain a robust environment for your Data Science project, without becoming crazy.

In this article, I am going to explain about Docker's main concepts, the most common commands and a quick example of dockerized machine learning application. Let's get started!


Table of contents:

  • What is Docker?
  • Basic concepts of Docker
  • Virtual Machine Vs Container
  • Setting Up Docker
  • Dockerize a ML application
  • Summary of Docker commands
  • Limitations of Docker

What is Docker?

Docker is a very popular virtualization technology that allows developers to rapidly develop, run and deploy Machine Learning applications in few minutes.

This is possible thanks to the Containers, that are isolated environments where you can include all the dependencies and run the application in a fast and consistent way.

Using this platform, you can manage the infrastructure and your applications at the same time. In addition, you can reduce the time between writing the code and deploying it.

Basic concepts of Docker

Illustration by Author

Before going further, it's important to master three concepts related to Docker:

  • Docker File contains the instructions to build the Docker Image, su as defining the operating system, specifying the dependencies of our application and so on.
  • Docker Image is created when we build it starting from the Docker file.
  • Docker Container is obtained after running the Docker Image. It's an isolated and independent environment that can run anywhere.

If you take a look at the above figure, you can fix the concepts better. The Docker File is similar to the Cake Recipe conceptually since it contains the ingredients of our object of interest, the Docker Image is the dough and the Docker Container is our desired cake.

Virtual Machine Vs Container

Illustration by Author. Virtual Machine Vs Container

Virtual Machines and Containers are virtualization technologies that allow us to run multiple isolated environments within a physical infrastructure. They are both designed to optimize resources and costs, but they have some key differences.

Within the Virtual Machine, there are different guest operating systems that run. Differently, Containers share the host operating system, resulting in consuming less resources than Virtual Machines.

Since Containers encapsulate only the applications and its dependencies, they are highly portable, making easier and faster the deployment process.

Setting Up Docker

Screenshot by Author. It shows Docker instructions.

Docker Desktop is the application you need to build, share and run containerized applications. It can be installed on Linux, Windows or Mac.

That's important to know that Docker runs natively only in Linux since it relies on certain features and functionalities of the Linux kernel to create and manage Containers.

This means that it doesn't require any additional virtualization layers, unlike non-Linux systems such as Windows and Mac. As a result, Docker on Linux can achieve higher resource utilization and lower overhead.

To create and push the image to Docker Hub, it's required to create an account on Docker Hub, which is a central repository that allow to find and share Docker images.

Dockerize a ML application

Once you have familiarized with Docker's concepts, it's time to show an example of dockerizing a machine learning application. To follow easily the tutorial, I would recommend you to use Visual Studio Code as code editor.

In this mini-project, I am going to use the Tours and Travels Churn Prediction dataset from Kaggle. So, this task consists in predicting if a customer of travel company will churn based on several variables, such as age, annual servises and so on.

Feel free to take a look at the GitHub repository to follow the tutorial better.

All the code for training and make predictions with the catboost model is saved into a script, called train_churn_model.py.

Python">import pandas as pd

from sklearn.model_selection import train_test_split

from catboost import CatBoostClassifier, Pool

from sklearn.metrics import recall_score, precision_score

# Load churn prediction dataset
churn_df = pd.read_csv('Customertravel.csv')
X = churn_df.drop(columns=['Target'],axis=1)
y = churn_df['Target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=123)
train_data = Pool(data=X_train,label=y_train,cat_features=[1,2,4,5])
val_data = Pool(data=X_val,label=y_val,cat_features=[1,2,4,5])
test_data = Pool(data=X_test,label=y_test,cat_features=[1,2,4,5])

# Train a catboost model
model = CatBoostClassifier(n_estimators=500,
                           learning_rate=0.1,
                           depth=4,
                           loss_function='Logloss',
                           random_seed=123,
                           verbose=True)
model.fit(train_data,eval_set=val_data)

# Make predictions
y_train_pred = model.predict(train_data)
y_val_pred = model.predict(val_data)
y_test_pred = model.predict(test_data)

# Calculate precision and recall
train_precision_score = precision_score(y_train, y_train_pred)
train_recall_score = recall_score(y_train, y_train_pred)
val_precision_score = precision_score(y_val, y_val_pred)
val_recall_score = recall_score(y_val, y_val_pred)
test_precision_score = precision_score(y_test, y_test_pred)
test_recall_score = recall_score(y_test, y_test_pred)

# Print precision and recall
print(f'Train Precision: {train_precision_score}')
print(f'Val Precision: {val_precision_score}')
print(f'Test Precision: {test_precision_score}')
print(f'Train Recall: {train_recall_score}')
print(f'Val Recall: {val_recall_score}')
print(f'Test Recall: {test_recall_score}')

To dockerize our application, there will be the following steps:

  • Create requirements.txt
  • Write Dockerfile
  • Build the Docker Image
  • Build the Docker Container

  1. Create requirements.txt

To ease dockering our application, we need the file requirements.txt, that includes all the python dependencies.

It can be automatically created by installing the library pigar and, then, running the command line pigar generatein your terminal.

You should obtain a file like this:

catboost==1.2.5
pandas==2.2.2
scikit-learn==1.4.2

2. Write Dockerfile

In addition to requirements.txt, we create a file named as Dockerfile. It contains the instructions to build the Docker Image.

FROM python:3.10

WORKDIR /src

# Copy the requirements file and install dependencies
COPY train_churn_model.py requirements.txt Customertravel.csv /src/
RUN pip install --no-cache-dir -r requirements.txt 

# Run the script
CMD ["python","train_churn_model.py"]

The FROM command specifies the base environment used for the project. In this case, it was Python 3.10.

After we set up the working directory and copy the files requirements.txt,train_churn_model.py and Customertravel.csv. Once the file requirements.txt is copied, we can install the dependendies.

At the end, the CMD command allows to include the command to run the script.

3. Build the Docker Image

Once the files requirements.txt and Dockerfile are created, most of the efforts are done. To create the Docker Image, we just need to use the build command:

docker build -t churn-pred-image .

After building the Docker Image, named "churn-pred-image", we check all the Images. It's important to be sure that we have created successfully our image.

docker images

This is the following list of images obtained from the command:

REPOSITORY         TAG       IMAGE ID       CREATED             SIZE
churn-pred-image   latest    f2d735527110   About an hour ago   1.81GB

If you have created other images, your table will contain more rows, each one corresponding to a different image.

4. Build the Docker Container

Finally, we are ready for building the Docker Container. Now that we have the Image built, we just need to run the container:

docker run -d --name churn-pred-container churn-pred-image

In the tag --name, we just need to specify the desided name of the Docker Container, followed by the name of the Docker image previously built at the end.

Like previously, we want to display all the containers created until now using the flag -a:

docker ps -a

This is the output:

CONTAINER ID   IMAGE              COMMAND                  CREATED             STATUS                      PORTS     NAMES
7865084c8e70   churn-pred-image   "python train_churn_..."   About an hour ago   Exited (0) 17 minutes ago             churn-pred-container

That's it! We have dockerized our machine learning application!

Summary of Docker commands

  • docker build -t to build the Docker Image
  • docker run -d --name to build the Docker Container
  • docker images to display the list of created images
  • docker ps -a to show the list of containers
  • docker rmi to remove an image
  • docker stop to stop a running container
  • docker rm to remove a stopped container

Limitations of Docker

Like other virtualization technologies, Docker has some limitations. These are the main disadvantages:

  • It takes time to understand how to write Docker files, build Images, and manage Containers if you are new to these concepts.
  • Even if Containers are lightweight and require less resources than VMs, there can be security issues due to the common operating system.
  • Docker struggles with use cases that requires a graphical user interface since it was initially designed for applications that don't require a GPU.

Final thoughts

This was an introductory guide that can help you to getting started with Docker.

Docker can be a powerfull tool for data science projects. To consider whether Docker is the right choice for your specific use case, it's crucial to take into account its good and bad points.

If you to go deeper into the topic, you take a look at the resources listed at the end of the article.

I hope you have found the article useful. Have a nice day!


Disclaimer: This data set is licensed under CC0 1.0 Universal (CC0)


Useful resources:

Tags: Data Science Docker Machine Learning Programming Python

Comment