An Intuitive Guide to Docker for Data Science

While working as data scientist, it's important to write code that runs in any operating system with all the necessary dependencies, and is ready to be deployed on the cloud. Despite your efforts, it may still not work and you may lose time to understand what the problem is.
What tool can we use to avoid this struggling? Docker is the solution to your problems. Using docker, you can easily obtain a robust environment for your Data Science project, without becoming crazy.
In this article, I am going to explain about Docker's main concepts, the most common commands and a quick example of dockerized machine learning application. Let's get started!
Table of contents:
- What is Docker?
- Basic concepts of Docker
- Virtual Machine Vs Container
- Setting Up Docker
- Dockerize a ML application
- Summary of Docker commands
- Limitations of Docker
What is Docker?
Docker is a very popular virtualization technology that allows developers to rapidly develop, run and deploy Machine Learning applications in few minutes.
This is possible thanks to the Containers, that are isolated environments where you can include all the dependencies and run the application in a fast and consistent way.
Using this platform, you can manage the infrastructure and your applications at the same time. In addition, you can reduce the time between writing the code and deploying it.
Basic concepts of Docker

Before going further, it's important to master three concepts related to Docker:
- Docker File contains the instructions to build the Docker Image, su as defining the operating system, specifying the dependencies of our application and so on.
- Docker Image is created when we build it starting from the Docker file.
- Docker Container is obtained after running the Docker Image. It's an isolated and independent environment that can run anywhere.
If you take a look at the above figure, you can fix the concepts better. The Docker File is similar to the Cake Recipe conceptually since it contains the ingredients of our object of interest, the Docker Image is the dough and the Docker Container is our desired cake.
Virtual Machine Vs Container

Virtual Machines and Containers are virtualization technologies that allow us to run multiple isolated environments within a physical infrastructure. They are both designed to optimize resources and costs, but they have some key differences.
Within the Virtual Machine, there are different guest operating systems that run. Differently, Containers share the host operating system, resulting in consuming less resources than Virtual Machines.
Since Containers encapsulate only the applications and its dependencies, they are highly portable, making easier and faster the deployment process.
Setting Up Docker

Docker Desktop is the application you need to build, share and run containerized applications. It can be installed on Linux, Windows or Mac.
That's important to know that Docker runs natively only in Linux since it relies on certain features and functionalities of the Linux kernel to create and manage Containers.
This means that it doesn't require any additional virtualization layers, unlike non-Linux systems such as Windows and Mac. As a result, Docker on Linux can achieve higher resource utilization and lower overhead.
To create and push the image to Docker Hub, it's required to create an account on Docker Hub, which is a central repository that allow to find and share Docker images.
Dockerize a ML application
Once you have familiarized with Docker's concepts, it's time to show an example of dockerizing a machine learning application. To follow easily the tutorial, I would recommend you to use Visual Studio Code as code editor.
In this mini-project, I am going to use the Tours and Travels Churn Prediction dataset from Kaggle. So, this task consists in predicting if a customer of travel company will churn based on several variables, such as age, annual servises and so on.
Feel free to take a look at the GitHub repository to follow the tutorial better.
All the code for training and make predictions with the catboost model is saved into a script, called train_churn_model.py.
Python">import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import recall_score, precision_score
# Load churn prediction dataset
churn_df = pd.read_csv('Customertravel.csv')
X = churn_df.drop(columns=['Target'],axis=1)
y = churn_df['Target']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=123)
train_data = Pool(data=X_train,label=y_train,cat_features=[1,2,4,5])
val_data = Pool(data=X_val,label=y_val,cat_features=[1,2,4,5])
test_data = Pool(data=X_test,label=y_test,cat_features=[1,2,4,5])
# Train a catboost model
model = CatBoostClassifier(n_estimators=500,
learning_rate=0.1,
depth=4,
loss_function='Logloss',
random_seed=123,
verbose=True)
model.fit(train_data,eval_set=val_data)
# Make predictions
y_train_pred = model.predict(train_data)
y_val_pred = model.predict(val_data)
y_test_pred = model.predict(test_data)
# Calculate precision and recall
train_precision_score = precision_score(y_train, y_train_pred)
train_recall_score = recall_score(y_train, y_train_pred)
val_precision_score = precision_score(y_val, y_val_pred)
val_recall_score = recall_score(y_val, y_val_pred)
test_precision_score = precision_score(y_test, y_test_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Print precision and recall
print(f'Train Precision: {train_precision_score}')
print(f'Val Precision: {val_precision_score}')
print(f'Test Precision: {test_precision_score}')
print(f'Train Recall: {train_recall_score}')
print(f'Val Recall: {val_recall_score}')
print(f'Test Recall: {test_recall_score}')
To dockerize our application, there will be the following steps:
- Create requirements.txt
- Write Dockerfile
- Build the Docker Image
- Build the Docker Container
- Create requirements.txt
To ease dockering our application, we need the file requirements.txt
, that includes all the python dependencies.
It can be automatically created by installing the library pigar and, then, running the command line pigar generate
in your terminal.
You should obtain a file like this:
catboost==1.2.5
pandas==2.2.2
scikit-learn==1.4.2
2. Write Dockerfile
In addition to requirements.txt
, we create a file named as Dockerfile. It contains the instructions to build the Docker Image.
FROM python:3.10
WORKDIR /src
# Copy the requirements file and install dependencies
COPY train_churn_model.py requirements.txt Customertravel.csv /src/
RUN pip install --no-cache-dir -r requirements.txt
# Run the script
CMD ["python","train_churn_model.py"]
The FROM
command specifies the base environment used for the project. In this case, it was Python 3.10.
After we set up the working directory and copy the files requirements.txt
,train_churn_model.py
and Customertravel.csv
. Once the file requirements.txt
is copied, we can install the dependendies.
At the end, the CMD
command allows to include the command to run the script.
3. Build the Docker Image
Once the files requirements.txt
and Dockerfile
are created, most of the efforts are done. To create the Docker Image, we just need to use the build command:
docker build -t churn-pred-image .
After building the Docker Image, named "churn-pred-image", we check all the Images. It's important to be sure that we have created successfully our image.
docker images
This is the following list of images obtained from the command:
REPOSITORY TAG IMAGE ID CREATED SIZE
churn-pred-image latest f2d735527110 About an hour ago 1.81GB
If you have created other images, your table will contain more rows, each one corresponding to a different image.
4. Build the Docker Container
Finally, we are ready for building the Docker Container. Now that we have the Image built, we just need to run the container:
docker run -d --name churn-pred-container churn-pred-image
In the tag --name
, we just need to specify the desided name of the Docker Container, followed by the name of the Docker image previously built at the end.
Like previously, we want to display all the containers created until now using the flag -a
:
docker ps -a
This is the output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7865084c8e70 churn-pred-image "python train_churn_..." About an hour ago Exited (0) 17 minutes ago churn-pred-container
That's it! We have dockerized our machine learning application!
Summary of Docker commands
docker build -t
to build the Docker Imagedocker run -d --name
to build the Docker Containerdocker images
to display the list of created imagesdocker ps -a
to show the list of containersdocker rmi
to remove an imagedocker stop
to stop a running containerdocker rm
to remove a stopped container
Limitations of Docker
Like other virtualization technologies, Docker has some limitations. These are the main disadvantages:
- It takes time to understand how to write Docker files, build Images, and manage Containers if you are new to these concepts.
- Even if Containers are lightweight and require less resources than VMs, there can be security issues due to the common operating system.
- Docker struggles with use cases that requires a graphical user interface since it was initially designed for applications that don't require a GPU.
Final thoughts
This was an introductory guide that can help you to getting started with Docker.
Docker can be a powerfull tool for data science projects. To consider whether Docker is the right choice for your specific use case, it's crucial to take into account its good and bad points.
If you to go deeper into the topic, you take a look at the resources listed at the end of the article.
I hope you have found the article useful. Have a nice day!
Disclaimer: This data set is licensed under CC0 1.0 Universal (CC0)
Useful resources: