How to Build Data Pipelines for Machine Learning

Author:Murphy | View: 23850 | Time: 2025-03-22 21:51:05

This is the 3rd article in a larger series on Full Stack Data Science (FSDS). In the previous post, I introduced a 5-step project management framework for building machine learning (ML) solutions. While ML may bring to mind fancy algorithms and technologies, the quality of an ML solution is determined by the quality of the available data. This raises the need for data engineering (DE) skills in FSDS. This article will discuss the most critical DE skills in this context and walk through a real-world example.

Full Stack Data Science (FSDS) involves managing and implementing ML solutions end-to-end. Data Engineering is critical to this process i.e. making data readily available for analytics and ML applications [1].

While this can involve a wide range of tasks (e.g., data modeling, designing schema, managing distributed systems), in the context of FSDS, data engineering comes down to one key thing—building data pipelines.

A Data Pipeline gets data from point A to point B. For example, scraping a webpage, reformating the data, and loading it into a database. Data pipelines consist of 3 steps – extract (E), transform (T), and load (L), which can be combined in two ways – ETL or ELT.

ETL vs ELT

While the difference between these two pipeline types may sound subtle, they look very different in practice [2].

ETL (extract, transform, load) involves extracting data from a source, transforming it into a useful form, and storing it in a fixed format (typically a table in a database or data warehouse). This is great for data that can be represented as rows and columns and serves a narrow set of downstream applications, such as customer-level sales data, for training a propensity model.

ELT (extract, load, transform), on the other hand, extracts data and loads it in its raw format (typically into a data warehouse or data lake). This centralizes an organization's various data sources in a way that provides more flexibility in downstream use cases. Additionally, ELT supports unstructured and semi-structured data (i.e. data not represented as rows and columns).

ETL (for FSDS)

Although ELT is becoming the norm across enterprises [2], full-stack data scientists typically build product-focused data pipelines (i.e. pipelines built for particular use cases). Thus, the ETL paradigm may be a natural choice for most ML projects.

Let's walk through each stage of this type of pipeline.

Extract

The first step of any data pipeline is extracting data from its source. This could be leads data from a CRM, sales data from an e-commerce platform, marketing data from social media, etc. While enterprises can have hundreds (if not thousands) of data sources [2], most ML projects will involve only a few key data sources.

Most data sources for most businesses run through 3rd parties, which can (ideally) be accessed via official APIs, e.g. HubSpot's API, Shopify's API, and Meta's APIs.

However, data from some sources may not be readily available and require custom software to extract the necessary information. This may involve scraping public web pages, gathering data from a self-hosted system, collecting raw sensor data, or extracting documents from a file system.

Transform

At extraction, data are typically semi-structured (e.g., JSON, .csv) or unstructured (documents, images, or other binary files). The transform step aims to translate these data into a format suitable for downstream tasks.

This can consist of a wide range of operations. Some examples are given below.

Managing data types and ranges
Deduplication
Imputing missing values
Handling special characters and values
Feature engineering
Converting data into structured format (i.e. table for a database or data warehouse).

As the list above suggests, the transform step requires full-stack data scientists to wear their data engineer and data scientist hats. Additionally, this step may be revisited multiple times during an ML project, as new insights are gained during model development.

Load

Once transformed, data must be made readily available for downstream tasks. Data warehouses and data lakes have become standard across enterprises because they are built to accommodate petabyte-scale data needs.

However, many ML systems can be developed using GB-scale data. For example, an FP64 array with 1M rows and 1K columns will require ~10GB of storage. In these cases, a relational database (e.g., MySQL, Postgres SQL) or a simpler data store (e.g., S3, Google Drive, Dropbox, a system's file directory) may be sufficient.

While the best storage solution will depend on the details of the specific use case, I believe in keeping things as simple as possible and only seeking more sophisticated solutions once things start to break. Here's a coarse-grained guide I like to use as a starting point.

MB-scale, few sources: Keep it in the project directory
GB-scale, few sources: Simple storage (e.g. S3, Google Drive, Dropbox)
GB-scale, many sources: Relational database (e.g. MySQL, PostgreSQL)
TB-scale, many sources: Data Warehouse (e.g. Amazon Redshift, Google BigQuery, Snowflake)
PB-scale, many sources: Data Lake (e.g. Databricks Delta Lake, Snowflake, AWS Lake Formation)

Orchestration

As data sources and processes grow, implementing and managing data pipelines becomes challenging. This is where orchestration tools (like Airflow) are helpful.

Airflow is popular among data engineers because it provides a fully Python way to schedule and manage batch workflows. This means data pipelines can be set to run at specific time intervals or upon specific events.

A DAG for a simple data pipeline. Image by author.

Data Management

Once a data pipeline moves out of the development environment, it's important to implement processes for managing and monitoring the data effectively. While the degree of management will depend on the volume and complexity of the data pipeline, here are a few key things to consider.

Data Dictionaries: Human readable descriptions of the data at both table and column levels (or folder and file levels).
Data Versioning & Backups: Keep track of multiple versions of data and ensure proper backups. This comes built-in with many data platforms.
Data Observability: Setting up monitoring systems to track performance and send alerts when issues arise.
Data Governance: Defining rules around data to ensure quality, security, compliance, track linage, and define ownership. This becomes critical when juggling multiple use cases and data sources.

Example Code: ETL of YouTube Video Transcripts

Now that we have a basic understanding of data pipelines let's walk through a simple yet real-world example of building one. Here, I build upon the example case study from the previous post and implement an ETL pipeline to curate automatically generated captions from all my YouTube videos.

Tags: Data Engineering Data Pipeline Data Science Getting Started Machine Learning