Git Workflow for Machine Learning Projects: the Git Workflow I use in my Projects
Adopting a Git workflow in your projects eases project management and increases consistency. There are several Git workflows designed to meet the needs of Git users: some are straightforward and others are more elaborate intended for large projects. In this article, I will share with you my own git workflow that I adopt in my machine learning and Data Science projects. My workflow lies somewhere between simplicity and complexity – neither too simple nor too complex. Without delay, let's dig in!
Not a Medium member? No worries! Continue reading with this friend link.
Introduction
A Git workflow is defined as a set of conventions and practices designed to standardize the management of projects version controlling and thus increase consistency and facilitate collaboration. In a previous tutorial, I presented in details 3 workflows that I consider as the most essential to learn: namely, feature branch workflow, forking workflow and Gitflow workflow.
In feature branch workflow, a dedicated branch is created for each feature development, bug fix and other project-oriented tasks. In forking workflow, the official server-side repository is cloned in the server-side personal repository; the changes are pushed to the personal repository and a Pull Request is performed when updating the official repository is desired. As in Gitflow workflow, a set of conventions is established for how branches are organized and managed: it defines specific branch names and their roles. The first workflow is straightforward, the second workflow is typically used when there is lack of permission to directly change the official repository as in opensource projects; and the third workflow is designed for live projects that are continuously evolving. if you are interested in more details about these workflows, I invite you to consult my tutorial : Mastering Git: The 3 Essential Workflows for Efficient Version Controlling.
My Git workflow is actually inspired by Gitflow workflow and the project structure of Machine Learning projects. I set up a set of conventions and rules on how branches are created, their roles and how they are managed. I call my workflow: Machine Learning workflow!
Machine Learning workflow
As this Git workflow is inspired by the project structure of machine learning projects, let's briefly review a basic template for project structuring. A machine learning project, mainly, includes the following folders and files:
mlops_template/
├── LICENSE
├── README.md
├── Makefile
├── configs
│ └── model1.yaml
│
├── data/ # Data set
│
├── docs/ # Project documentation.
│
├── models/ # Trained and serialized models.
│
├── notebooks/
│
├── references/ # Data dictionaries, manuals, ...
│
├── reports/ # Generated analysis as HTML, PDF, etc.
│ └── figures # Generated graphics and figures ...
│
├── requirements.txt
└── src # Source code for use in this project.
├── __init__.py
│
├── data/ # Data engineering scripts.
│
├── models # ML model engineering
│ └── model1
│ ├── dataloader.py
│ ├── hyperparameters_tuning.py
│ ├── model.py
│ ├── predict.py
│ ├── preprocessing.py
│ └── train.py
│
└── visualization/
If you are interested in more details about the machine learning project structure, I invite you to consult my tutorial: Structuring Your Machine Learning Project with MLOps in Mind. Among the different files, I'm specifically interested in those that are not ignored by Git. These latter can be divided into 5 different groups:
- The project configuration files such as :
Licence
,README.md
,makefile
andrequirements
. - The document files that includes notebook files, docs files, reports files and references files.
- The scripts for data processing.
- The scripts for model creation, training and validation.
- The scripts for visualization.
This grouping inspired me the different branches that I set to define my Git workflow.
Machine Learning workflow is also inspired by Gitflow workflow and it's a feature branch workflow with some branches naming conventions and rules. The reason behind my adoption of a new workflow is that having a single branch type named ‘feature' seems more suited to software or web development. However, in machine learning projects, I've always felt the need to separate tasks not only in project structuring but also in Git branches. Therefore, my adapted workflow includes a single principal branch (the main or the master branch) and and seven supporting types of branches :
- The main branch represents the project working code instead of having two branches the production-ready and the development code as in Gitflow workflow. The main branch is named
main
ormaster
. - The configuration branches represent the different project configuration files including : license file, make files, requirements files, etc. Their name is preceded by the prefix config as follows:
config/
. - The document branches represent the different project documents including: notebook files, docs files, reports files and references files. Their name is preceded by the prefix document or doc as follows:
document/
ordoc/
. - The data branches are created for a new feature development related to data processing including : ingesting, cleaning, etc. Their name is preceded by the prefix document or doc as follows:
document/
ordoc/
. - The model branches are created for a new feature development related to machine learning models management including : creation, training and validation. Their name is preceded by the prefix model as follows:
model/
. - The visualization branches are created for a new feature development related different kinds of visualization. Their name is preceded by the prefix visualization or vis as follows:
visualization/
orvis/
. - The hotfix branches are created to fix the issues in the project code and its name is preceded by the prefix hotfix, for example :
hotfix/
. - The test branches are created for feature testing but it depends on the project complexity and it's optional. Its name is preceded by the prefix test, for example :
test/
.
The described branches are managed as follows:
- If an issue is detected in the main branch, a hotfix branch is created from the master branch and merged into the main branch once it's completed.
- The test branches are created from the main branch and merged into it once it's completed.
- The feature branches (document, configuration, data, model and visualization) are created from the main branch and merged into it once it's completed.
In summary, all the branches are created from the main branch and merged into it when completed. You can add other branches for additional purposes or set a branch for each document type. However, I recommend maintaining a concise number of branches with significant roles. The benefit of this approach lies in its flexibility: you can easily switch to other more elaborated workflows such as Gitflow workflow with dev and release branches whenever the need arises.
Conclusion
Here comes the end of this article! In this article, I shared with you how I manage my AI projects using my personal version controlling workflow. How do you feel about this approach? Do you have your own workflow that you find effective? Please feel free to discuss it with us in the comments! We will enjoy the discussion.
My aim through my articles is to provide my readers clear, well-organized and easy-to-follow tutorials, offering a solid introduction to the diverse topics I cover and promoting good coding and reasoning skills. I am on a never-ending journey of self-improvement, I share my findings with you through these articles. I, myself, frequently refer to my own articles as valuable resources when needed.
Thanks for reading this article. If you appreciate my tutorials, please support me by following me and subscribing to my mailing list. This way, you'll receive notifications about my new articles. If you have any questions or suggestions, feel free to leave a comment.
Image credits
All images and figures in this article whose source is not mentioned in the caption are by the author.