How I Set Up Data Science Projects (with VS Code and DVC)

Author:Murphy  |  View: 21439  |  Time: 2025-03-23 20:02:02

Introduction

Setting up a development environment is usually the first step people do when starting any coding project. An effective development environment can be a huge productivity boost that helps us to produce high-quality work. This process in Data Science, however, is very ambiguous compared to other fields like software development due to its own uniqueness and challenges.

In this article, I'm going to explain how I set up the working environment for a data science project for myself, the motivation behind it, and the tools I use to build the environment that works for me and my team.

Photo by Remy_Loz on Unsplash

Motivation

When I was still working as a software engineer, the first thing I did when joining a team is setting up a development environment. It usually consists of tasks such as checking out code repositories, installing required libraries/frameworks, and familiarizing with third-party tools. Once finished, you have a ready-to-use environment for working with the team. But, when I made a transition to data science, such a process doesn't quite exist at my company. Data scientists worked on cloud-based notebook services, the code wasn't version controlled, and the data was uploaded to cloud storage without any documentation. Settling into the team was complete chaos. Let alone collaborating with others.

To make it less painful, I started designing a standard working environment for the team by using the one from software development as an inspiration. However, data science is different from software development. Simply copying and pasting won't be sufficient. Data science has its own challenges. Thus, it needs some modifications. Here are 3 additional requirements that I've considered when designing the data science working environment:

  1. Code and Data Tracking Unlike software projects, data science projects are not just the product of code but also data. An effective data science working environment needs version control not only for code but also data.

  2. Seamless Remote Development Experience Processing data beyond the capacity of an ordinary machine is typical for data science projects, that's why people prefer running code on a remote workstation. The environment we're designing should make the remote development experience as seamless as possible.

  3. Easy-to-use Experiment Tracking Data science works are experimental. Data scientists can spend the whole day changing hyperparameters back and forth to compare their effect on the model. An easy-to-use experiment tracking facility is a must-have.

Now that we have clear requirements of what we want, the next step is to find the tools to enable them.

Tools

Git + DVC

Starting from project tracking, Git is a universal standard in software development for years, so there is only one option for code version control. Data versioning, however, is trickier. Git is designed to use small and text-based files while most of our ML projects contain huge and unstructured data. There are 2 options that we've explored: Git LFS and DVC. We ended up using DVC because its features are data-science-centric.

Git and DVC for tracking code and data(Image by author).

Visual Studio Code

As mentioned earlier, ML projects usually require computing power beyond local machine. That's why ML practitioners use cloud-based notebook services like Google Colab, SageMaker, and Vertex AI. But, Jupyter notebook doesn't have any coding assistance features like modern IDEs. Those features, if used right, could be a huge productivity boost. Therefore, we migrated from Jupyter notebook to Visual Studio Code.

For the computing resource, we use the remote development feature of VS Code to connect to a code and data repository on a cloud remote server. This way, we don't have to choose between high computing power and good developer experience. We can get the best out of both. More detail about VS Code Remote Development can be found here.

Coding on Jupyter notebook VS Coding on VS Code(Image by author).

DVC VS Code Extension

Another key aspect of data science workflow is experiment tracking. Data scientists run ML experiments over and over with different parameters before getting the final model. Tracking those experiments is an extremely exhausting task. There are tons of experiment tracking tools such as TensorBoard, MLFlow, and WandB to name a few. The concern we have with them is that they are either a separate service needed to be hosted or a third-party SaaS that is required to send data to their APIs.

The option we ended up with is dvc metrics, a feature of DVC. DVC metrics let you save the metrics to a JSON or YAML file(which we already did before using DVC) and specify that this is a metrics file. DVC will start tracking it similarly to other data except that you can plot, visualize, and compare the content of this file using CLI. The experiment report generated by CLI is significantly less user-friendly compared to those provided by TensorBoard and MLFlow. Luckily, DVC provides a VS Code extension which helps mitigate this problem to the level that we feel acceptable.

One limitation we found using DVC to manage experiments is when there are multiple engineers working simultaneously on the same model and want to compare their results. They have to clone other's branches in order to do so because metrics are tracked on Git. Since we have a small team, this situation doesn't happen often. Therefore, it's a limitation we can live with.

DVC VS Code Extensioin(Image from VS Code Extension Marketplace).

Workflow

Following is the process that we use when we develop a data science project.

  1. Creating a git and DVC repository We start by creating a git repository for the code with the Cookie Cutter Data Science project template. Having a standard project template helps developers work together more smoothly as the project already has a clear structure. Once the code setup is finished, we add the data to the project using DVC and push it to our remote storage so that it can be accessed by the team.

  2. Setting up a remote development environmentThe next is to setup a development server in which data scientists run their experiments. We open an EC2 instance, login into the server, checkout the code from git, and pull data from remote storage using DVC.
  3. Implement and Experiment Now that the development environment is ready, we can start working. As mentioned earlier that we use VS Code for development, there are 2 extensions we need to install: Remote SSH and DVC. The first is used for remotely connecting to the environment created from the previous step so we can run the code on a high-performance machine. The DVC extension is used mainly as an experiment tracking tool. Once they are installed and configured, data scientists can start writing the code, running the experiment, and tuning the parameters.

  4. Open merge request and do a code review Once data scientist gets the model they satisfy, they will commit all the changes and open a merge request to a git repository and ask for a code review. When the pull request is raised, our CI/CD pipeline will create a report comparing the metrics of the new branch with the target branch so that others can see how new changes impact the model before approve/reject the request(Check this article if you're interested in this part).

Overview of the workflow(Image by author).

Conclusion

That's it. This is how I setup a development environment for a data science project using Git, VS Code, and DVC. Once my team started using this standard in our team, the process of onboarding new members becomes much smoother because there is no ambiguity in the tools and process the new member should use. Moreover, the collaboration between team members become more effective because they're speaking the same language.


Originally published at https://thanakornp.com.

Tags: Data Science

Comment