Why Data Processing Is a Vital Skill for Data Scientists

Author:Murphy | View: 24069 | Time: 2025-03-23 11:57:50

Opinion

Image created by the author using Midjourney.

When I pivoted to a career in data science after completing my PhD in physics, I was fascinated by complex models and insightful data analysis.

Four years later, I know that these aspects of the job carry less and less weight.

The number one ingredient for successful AI applications is data. And the number one time sink in any of my projects is Data Processing.

Welcome to the real world

Introductory courses focus on model development and understanding the inner workings of training a neural network.

We learn to write our own training loop, choose the right validation metric, and understand the bias-variance trade-off.

Students work with readily available datasets like MNIST. Our courses gloss over the data aspect, and indeed for tutorials data processing is as simple as typing

from torchvision.datasets import MNIST
dataset = MNIST('./data', download=True)

But the moment you enter the real world, your datasets are no longer available through one line of code.

Difficulties with data

Even finding the data you need to get started can be a daunting task. Data may be scattered across multiple databases that are difficult to reconcile. Data access may be regulated, and the overall sample count can be low.

Data formats vary a lot depending on the domain. I work with environmental data, which is often stored in netcdf format. In order to apply a standard machine learning pipeline, I first have to adapt my data.

Environmental Data Science: An Introduction

Data access is a major performance issue. You need to transfer the data to where your AI model training and inference takes place, and use a format suitable for random access during training. This can be slow and expensive. I would guess that 99% of my performance problems come from here.

Data quality and expertise

Everyone agrees that data quality is important, but few organizations are willing to invest in it.

High-quality data requires documentation, metadata, and adherence to protocols. Research datasets should be FAIR: findable, accessible, interoperable, and reusing digital assets.

Working with data requires domain expertise. For example, I often work with satellite data that has 12 channels instead of the familiar 3 RGB channels. I need to know the meaning of each of these channels and adapt standard computer vision processing techniques.

Data scientists and engineers

Now, you will find additional roles in your organization: domain experts understand the data deeply, and data engineers are able to build the best data pipelines.

I work in a small team, and we do not have this clear division. This may be different in larger companies.

The roles of data engineer and data scientist are quite entangled. The data scientists need to formulate clear expectations for the data pipeline, and there is a lot of going back and forth until the performance is optimal.

On the other hand, data scientists need to be in touch with domain experts. Only they understand the nature of the data, are familiar with its format and characteristics, and can help the data team get the most out of it.

Data and LLMs

As LLMs become more and more proficient at producing code and analyzing data, is the data scientist role soon outdated? Indeed, GPT-4 can complete the analysis of a dataset with little to no human supervision.

But show me the LLM that was able to bring three departments to the table and get them to agree on a common way to structure data in the company. The complexity of the data pipeline still requires human oversight.

In my opinion, the communicative aspects of the data scientist role are becoming more important with the rise of LLMs.

Models

With data processing taking up most of my time, where does that leave model development?

In my experience, the model architectures are becoming less and less important to the data scientist. I often use pretrained models from HuggingFace that I finetune for my tasks.

When I need to train from scratch, I usually find a useful architecture on GitHub.

How to Fine-Tune a Pretrained Vision Transformer on Satellite Data

With pretrained models taking over, the only thing that really sets you apart from competitors is your unique data.

Remarks

Although the amount of time I spend data wrangling is larger than I expected, I have found that I like this aspect of the data scientist role. It forces me to be creative, stay on top of data processing technology, and involves a lot of communication.

I wish that Data Science curricula would place more emphasis on data processing. Simple data loading routines fail when they come into contact with reality, and it is difficult to find real data sets to experiment with.

What are your views? What aspect of the data scientist role is most often overlooked? Would you like to see more specialization, or are you happy to cover data engineering tasks as well?

Tags: Careers Data Processing Data Science Machine Learning Personal Development