Why Data Scientists Need These Software Engineering Skills

Author:Murphy  |  View: 20783  |  Time: 2025-03-22 19:20:51

The role of a data scientist is now changing. Businesses no longer want PoC models in Jupyter notebooks as they provide zero value. That's why, as data scientists, we should up-skill ourselves in software engineering to better deploy our algorithms. In this article, I want to break down the essential software engineering skills you need to learn as a data scientist.

System Design

When building large-scale applications, multiple components are often involved, such as the front-end, database, APIs, and the machine learning model itself if it's an algorithm product.

Key concepts like caching, load balancing, the CAP theorem, scalability, etc., must be considered to build the best system possible for the particular scenario.

System design is important for data scientists because it helps us understand how the model will be used in production and ensures we build it in the most appropriate way for that system.

We want our model to go into production as smoothly as possible, and understanding the whole architecture helps tremendously with this.

Of course, you can take courses or watch videos online to learn system design. However, what I found best is to sit down with some software engineers before you build your machine learning algorithm and discuss how it could go into production.

This will give you hands-on experience and allow you to tap into the expertise of software engineers probably been doing these types of things for years.

If you do want to take a system design course, I recommend NeetCode's one.

NeetCode

Shell & Bash Scripting

Honestly, any tech professional should be competent in using the command line, as it's so helpful. Navigating the terminal and carrying out basic commands are almost necessities nowadays.

I am not saying you need to be some Vim or Nano wiz, but you should know the basics well and understand how files are sorted in UNIX systems because most servers and cloud providers are Linux based.

The reason you should know Bash or Zsh well is that when it comes to using things like docker, Kubernetes, git, or any cloud provider locally, proficiency in the command line is a must.

I promise you that at some point in your Data Science career, you will need to use the command line, so you might as well learn some of it now to prepare for that occasion.

I have a separate article detailing the command line and shell basics that you can check out below.

What is the Shell in Linux?

Testing

For some reason, data scientists are not taught to write good, well-tested code. Instead, they focus on implementing Machine Learning models and doing explanatory data analysis – basically, the fun stuff.

I totally understand why this is the case; it's a great way to introduce someone to the field. However, it is highly desirable to implement your algorithm using proper production code standards, as that's what generates business value, which is your primary goal of being a data scientist. You are paid to be a net positive/gain for the company.

A crucial component of this is being able to test your code. This includes writing unit tests, end-to-end tests, integration tests, and CI/CD pipelines. Writing tests may not sound that fun, but they are tremendously vital as they ensure that your code does exactly what you want it to do.

I can't tell you how many times I wrote a function or class that I felt was top-tier, only to realise I made several errors after conducting some basic tests. I wouldn't say I liked all these processes initially, but over time, I started to enjoy them and see their value.

There is even a whole paradigm of writing code called test-driven development (TDD). The entire premise is writing a test that fails, then writing just enough code that the test passes, and finally refactoring it to a high standard.

I have a separate article on how to do unit testing through pytest that you can check out below.

Pytest Tutorial: An Introduction To Unit Testing

Cloud Systems

AWS, Azure, and GCP are now used by literally every company in the world. They are the de facto way to store data and deploy many applications and systems.

According to a study, these three providers account for 66% of the total market share of cloud providers in 2023, with AWS taking half of that (33% of the total).

Given the widespread adoption of cloud technology, it is crucial for data scientists to grasp the basics of how these platforms function. While you don't need to become a cloud engineer, having a fundamental understanding is highly practical.

AWS is the most popular, with over 1.3 million companies, including top firms like Disney, Netflix, AirBnB, Meta and LinkedIn.

I recommend you have a basic understanding of the following as a data scientist:

  • S3 and general file storage
  • ECS and EC2 instances
  • Step functions and Lambda functions
  • Athena and other database tools

There is so much to know in this space that some people's jobs are literally AWS cloud engineers. You just need to know how to store data and deploy code on these systems, everything else will come with time.

Typing, Formatting & Linting

Testing is not the only thing you need to write high-quality production code. Typing, formatting, and linting are equally important to maintaining good standards and reducing the chance of errors and bugs.

Let's explain what these things are:

  • _Typing – This refers to specifying the datatypes for our variables and return types of functions. Python inherently is a dynamic language, so there is no formal requirement to declare what datatype our variables take on. However, adding types greatly helps with readability and debugging._
  • Formatting – Formatters are your code's best friend. They not only fix your code and streamline your workflow but also ensure your code aligns with style guides, making it more readable and understandable for others.
  • Linters – Tool that catches minor bugs, formatting errors, and odd design patterns that can lead to runtime problems and unexpected outputs.

Python contains many packages and tools that help with these processes and are actually very easy to apply. Some of my favourites are mypy, black, ruff, isort, and PEP8.

If you want a tutorial of how to apply these things, checkout my previous articles below.

A Data Scientist's Guide To Improving Python Code Quality

A Data Scientist's Guide to Python Typing: Boosting Code Clarity

Summary & Further Thoughts

I hope this article gave you insight into the critical knowledge areas you need in software engineering to be a well-rounded data scientist. You obviously can't learn these things overnight, but slowly getting comfortable with everything I listed above will put you in good stead for your career.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips and advice as a practising data scientist. Plus, when you subscribe, you will get my FREE data science resume and short PDF version of my AI roadmap!

Dishing The Data | Egor Howell | Substack

Connect With Me

Comment