Version Controlling in Practice: Data, ML Model, and Code

Author:Murphy | View: 28670 | Time: 2025-03-22 23:49:42

Version control is a crucial practice! Without it, your project may become disorganized, making it challenging to roll back to any desired point. You risk losing critical model configurations, weights, experiment results from extensive training periods, and even the entire project itself. You might also find yourself in disagreements and conflicts with your teammates when the code breaks, hindering effective collaboration. In this article, we navigate the importance of version control through a practical example that employs some of the most common tools in the field. The entire codebase for this article is accessible in the associated repository.

Not a Medium member? No worries! Continue reading with this friend link.

Table of contents:

· 1. Introduction · 2. Tools · 3. Setting up your project ∘ 3.1. Project folder ∘ 3.2. Project environment · 4. Code versioning · 5. Data versioning · 6. Model versioning · Conclusion

1. Introduction

Version controlling is the practice of recording changes to a file or setting of files over time, using version control systems, so that we can recall specific versions later. In MLOps, version controlling is one of the main principles that I consider it as the first one to consider when starting your machine learning projects. To ensure that we harness all the benefits, version control should be applied across different machine learning workflow steps, including data, the Machine Learning model (ML model), and code.

Why versioning? Using version control for code, data, and models enables reproducibility (which is an another important MLOps principle) by allowing to recreate specific states of the project at any given point in time; tracking and monitoring changes by establishing a systematic approach to capturing, documenting, and managing changes throughout the development lifecycle; collaboration by tracking changes made by different contributors, and merging those changes efficiently and many other important benefits such as error recovery and traceability.

Versioning use case? Let's consider a specific scenario in the handwritten digits classification project that we will use as an example throughout this article.

Code. Suppose we introduced optimizations to improve speed. **** However, after deployment, users reported unexpected inaccuracies in predictions. Thanks to the project's robust code versioning practices, we can promptly identify the commit associated with the bug and temporarily roll back the deployment before optimization integration while we address the bug, fix it and reintegrate it into the main project version.
Data. Suppose we decide to augment the dataset to enhance the model's generalization capabilities. However, after the augmented dataset is used in training, unexpected variations in model performance are observed. Therefore, we review the versioning history, identifies the specific augmentation technique that may be causing issues, and swiftly roll back to the previous version of the dataset. Then, we collaboratively work on refining the data augmentation approach, ensuring that only validated changes are reintegrated into the main project version.
ML model. Suppose now, we embark on refining the model architecture to boost accuracy. We implement a Convolutional Neural Network (CNN) for improved feature extraction and integrate it to the main project. However, during deployment, subtle discrepancies arise, impacting real-time predictions. Therefore, we roll back to the previous, more stable model version. Then, we collaboratively address the issues, conduct thorough testing, and integrate the refined model back into the main project version.

Although this is an article dedicated to how to use version control in your project, it's also part of my MLOps articles series. Furthermore, by following my previous and next tutorials you'll be able to create your own end-to-end MLOps project starting from the workflow to model deployment and tracking.

If you are interested in MLOps, check out my articles:

2. Tools

When working on machine learning projects or any computer science projects, before starting programming, the adequate tools to use need to be selected. The tools selection depends on different factors such as project requirement, team expertise, data volume and cost.

In this article, the following tools are selected:

Python as the programming language which is combination of a rich ecosystem, community support, ease of learning, versatility, integration capabilities, extensive libraries, Data Science tools, scalability, and industry adoption collectively contribute to its prominence in the realm of machine learning projects.
Git for code versioning. Git, that stands for Global Information Tracker, is an open-source distributed version control system (DVCS) widely used in software development for tracking changes in source code during the development of a project. It is an essential tool that enables teams to manage code changes effectively, collaborate seamlessly, and maintain a reliable version history . It has become a standard in the industry and is used by developers worldwide for projects of all sizes.
DVC for data versioning. DVC, that stands for Data Version Control, is an open-source version control system widely used for data management. It's designed to manage large datasets, make projects reproducible, and collaborate better. It works on top of Git repositories with similar feel and flows. One of the key features of DVC is data versioning: it allows to version control datasets separately from code. Therefore, data can be tracked, shared, and easily switched between different versions.
MLflow for model versioning. It is an open-source platform designed to manage the end-to-end Machine Learning lifecycle and foster collaboration among ML practitioners. Its compatibility with popular libraries, and strong community support make it an attractive choice for managing the complete machine learning lifecycle in a unified and scalable manner.

3. Setting up your project

Before getting started, ensure you have Git and DVC installed on your system. If it's not already installed, you can download and install it from the official Git website and the official DVC website respectively. Or if you are on Ubuntu you can simply execute the following command lines :

sudo apt install git-all # to install git
pip install dvc # to install DVD (do not install it for now!)

However, it's strongly recommended to create a virtual environment before installing DVD; thus, we will install it in the next few minutes after creating our virtual environment. In addition, note that:

DVC does not replace or include Git. You must have git in your system to enable important features such as data versioning and quick experimentation (recommended). [1]

3.1. Project folder

Let's start off by setting up the project folder! To do so, there are several approaches including:

Creating your folder from scratch: it is the most straightforward method, but it requires manually adding standard files and structuring the project afterwards. I do not recommend this approach when working on medium to large projects.
Importing an existing template: it is typically the optimal choice for simple maintenance, easy collaboration and good transparency, reproducibility and reusability. In this article, we will use the following project structure for machine learning projects created using this Github template or this Cookiecutter MLOps repository, but feel free to explore alternative templates. If you're eager to delve deeper into structuring ML projects, I invite you to read my dedicated article on the topic: Structuring Your Machine Learning Project with MLOps in Mind.

Project structure using this Github template or this Cookiecutter MLOps repository

Clone/fork an existing project: it is typically the optimal choice when working on existing projects. It supports collaboration and code reuse. For this article, feel free to clone or fork my repository to easily reuse the provided code. To clone the project use:

# Clone repository:
git clone [email protected]:Chim-SO/hand-written-digits-classification.git

Using a Github template or cloning a Github repository requires some familiarity of Github. However, rest assured! You can still follow this tutorial as I provide you with the necessary commands and explanations.

3.2. Project environment

Another essential step to execute is setting up the virtual environment which is a best practice in software development that enhances project isolation, dependency management, reproducibility, collaboration, and overall project cleanliness.

Let's start by creating a virtual environment named handwritten-digits-classification-env and activating it:

python -m venv venv/handwritten-digits-classification-env
source venv/handwritten-digits-classification-env/bin/activate

After that and in most cases when working with GPU, we need to update the environment with the appropriate Cuda version (see this article for more details). Nevertheless, to make this tutorial simple and accessible, a GPU is not required, especially since the project requirements are simple since the data and model are not large.
Finally, we install requirements and DVC by executing the following command:

pip install -r requirements.txt 
pip install dvc

4. Code versioning

After setting up the repository, we are now ready to start versioning! In this tutorial, we adopt a straightforward feature branch workflow. This workflow involves creating a dedicated branch for each new feature rather than making direct changes to the main branch. Then, we use the rebase/merge approach to seamlessly integrate the feature branch into the main branch.

We start by listing all the branches in the repository and check the current branch we are on that is typically marked with an asterisk (*):

git branch # List local branches
* master

git branch -r # List remote branches
remotes/origin/HEAD -> origin/master
origin/master

git branch -a # List all local and remote branches
* master
remotes/origin/HEAD -> origin/master
remotes/origin/master

Here I have only one branch that is the master branch and it is the current branch.

If you are not already in the main branch switch to it using:

git checkout master # switch to the main directory
git pull origin master # mendatory when working in collaboration but you can skip it now

We first create a branch called feature/data where we add all the code related to data processing:

git branch feature/data # to create a branch
git checkout feature/data # to switch to the created branch
# or use the combined creation and switch command
git checkout -b feature/data

After adding all the necessary code, we import the code to the main branch by using the merge command that incorporates the changes present in the named branch into the current working branch:

git checkout master # switch to the main directory
git merge feature/data # apply changes to master

Similarly, we create another branch called feature/model where we add all the code involving model creation, training and validating and merge it into the main branch :

# Model branch creation:
git checkout master # switch to the main directory
git checkout -b feature/model

# Development ...

# Merge branch
git checkout master # switch to the main directory
git merge feature/model # apply changes to master

At this point, we can say that we created a simple first version of our code! And it's time to mark this specific point adding a add a tag as follows:

git tag -a v1.0 -m "Version 1.0"

The entire workflow is described as follows:

Where each circle represents a commit that can be displayed using the command:

git log --pretty=format:"%h - %an, %ar : %s"

Back to our code problem example:

Let's say that after the deployment, a problem arises so we decide to temporarily roll back the deployment to the previous version the code in the deployment:

git revert   # Revert the merge commit

The revert operation undoes the modifications introduced by the specified commit by creating a new commit, still, we might need to resolve any conflicts that arise during the process, similar to what happens during a regular merge.

By using the commit history, we identify that a specific optimization in the model branch might be causing the issue. Thus, A hotfix branch named hotfix/inference-bug is created to address the bug:

git checkout -b hotfix/inference-bug

We then make the necessary corrections to the code and commit the changes:

git commit -m "Fix bug in digit classification during inference"

The bug fix is tested thoroughly, and a new pull request is opened for code review if we are really working in team and finally the hotfix is merged into the main branch:

git checkout main
git merge hotfix/inference-bug

The corrected code, free of the bug, is re-deployed to the production environment.

5. Data versioning

Now that the code it's ready, we can download data set it to it's first version then transform it into the csv format.

First, we need to make sure that the folder where we will store data isn't ignored by git. This is done by checking .gitignore file and removing/commenting the line that exclude the data folder from source control. If you're using the template I provided comment line 79.
Now, we start by creating a branch feature/data-csv , initializing the DVC project inside the project folder and finally adding to Git the created files:

# Branch creation:
git checkout master # switch to the main directory
git pull origin master # mendatory when working in collaboration but you can skip it now
git checkout -b feature/data-dvc

# DVC initialisation: 
dvc init

# Add to Git the created files:
git commit -m "chore: Initialize DVC."

Then, we download our dataset and add it to DVC and add the new DVC files to git:

# Download data
python src/data/ingestion.py -r data/raw

# Add data to dvc
dvc add data/raw/test_images.gz data/raw/test_labels.gz data/raw/train_images.gz data/raw/train_labels.gz

#Add dvc files to git and commit
git add data/raw/.gitignore data/raw/test_images.gz.dvc data/raw/test_labels.gz.dvc data/raw/train_images.gz.dvc data/raw/train_labels.gz.dvc
git commit -m "Add raw data"

Adding files to dvc will generate metadata that are stored in new files in a special .dvc file extension. Also, note that even though the data folder is tracked by git, it will be ignored once we add it to DVC since this later will create .gitignore and add data path to it.

We also transform it into csv format and add the resulted files data/preprocessed/train.csv and data/preprocessed/test.csv to DVC the same way as previously:

# transform data:
python src/data/build_features.py -r data/raw/ -p data/processed/

# Add to dvc: 
dvc add data/processed/train.csv data/processed/test.csv

#Add dvc files to git and commit
git add data/processed/.gitignore data/processed/test.csv.dvc data/processed/train.csv.dvc
git commit -m "Add processed data"

. At this point, the data is downloaded and created. The next step is the merge to the main branch and add a git tag :

# Apply changes:
git checkout master # switch to the main directory
git pull origin master # mendatory when working in collaboration, you can skip it now
git merge feature/data-dvc # apply changes to master

# Tag this point:
git tag -a v1.1 -m "Data collected and processed"

Back to our data problem example:

Let's say that we applied offline augmentation of the processed data and we added it to DVC:

# Add to dvc after update: 
dvc add data/processed/train.csv data/processed/test.csv
git add data/processed/test.csv.dvc data/processed/train.csv.dvc
git commit -m "Data augmentation offline"

However, after training, the model behaved poorly so we decide to re-use the previous version as follows:

git checkout data/processed/test.csv.dvc data/processed/train.csv.dvc
dvc checkout data/processed/test.csv.dvc data/processed/train.csv.dvc

6. Model versioning

As mentioned previously, we use MLflow to track and manage our model. Since for the moment we will work locally, we start a local MLflow Tracking Server:

mlflow server --host 127.0.0.1 --port 8080

Create a branch where we train and save our model:

# Branch creation:
git checkout master # switch to the main directory
git checkout -b feature/model-dvc

Now, we initiate an MLflow run context to start a run, train the model and then save it using MLflow:

# Create model:
model = create_model(x_train[0].shape)

# Log parameters:
loss = 'categorical_crossentropy'
metric = 'accuracy'

# Train:
model.compile(loss=loss, optimizer='adam', metrics=[metric])
history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1,
                    validation_data=(x_val, y_val))

# ....

# Set tracking server uri for logging
mlflow.set_tracking_uri(config['mlflow']['tracking_uri'])

# Create an MLflow Experiment
mlflow.set_experiment(config['mlflow']['experiment_name'])

# Start an MLflow run
with mlflow.start_run():
   # Save model:
   signature = infer_signature(x_train, y_train)
   mlflow.tensorflow.log_model(model, output_path, signature=signature)
   #Log other metrics and parametrics:
   # Next tutorial.

Merge to the main branch and add a git tag :

# Apply changes:
git checkout master # switch to the main directory
git merge feature/model-mlflow # apply changes to master

# Tag this point:
git tag -a v1.2 -m "Model versioning mlflow"

Train a model using the following command:

python -m src.models.cnn.train -c configs/cnn.yaml

Where configs/cnn.yaml file contains some configuration parameters like batch size and number of epochs.

We can view the run in the MLflow UI to see the results by simply navigate to the previous URL in our browser. Click on the experiment name cnn to list its associated runs and then click on the random name that has been generated for the run :

By clicking on the run name, the RUN page is displayed where the details of the execution are shown:

When you save a model using MLflow, it creates a directory structure containing the following:

data folder that includes the serialized files containing the model parameters.
MLmodel file that includes metadata about the model, such as the framework, the model's signature, and other properties.
conda.yaml , python_env.yaml and requirements.txt files that elps recreate the same environment when loading the model.

It also provides insights into the model schema and demonstrates how to execute predictions, offering flexibility with both Spark DataFrame and Pandas DataFrame. Another remarkable aspect of MLflow is its ability to preserve the commit ID from which the model was generated. Furthermore, it introduces a straightforward model registration option which a topic that will be explored in upcoming articles.

Conclusion

Here we come to the end of this article. In this article, we learned through a practical example the implementation of Version Control for the three elements within a machine learning project: the code, the data, and the machine learning model. Version controlling is a fundamental principle in MLOps that enables meticulous tracking, seamless collaboration, and robust reproducibility in machine learning workflows. The entire codebase for this article is accessible in the associated repository.

Thanks for reading this article. You can find all the examples of the different tutorials I provide in my GitHub profile. If you appreciate my tutorials, please support me by following me and subscribing. This way, you'll receive notifications about my new articles. If you have any questions or suggestions, feel free to leave a comment.