Build Deployable Machine Learning Pipelines

Author:Murphy | View: 26828 | Time: 2025-03-23 18:13:28

Image by Author: Generated with Midjourney

Background – Notebooks are not "Deployable"

Many data scientists' initial encounters with coding take place through notebook-style user interfaces. Notebooks are indispensable for exploration – a critical aspect of our workflow. However, they're not designed to be production-ready. This is a key issue I've observed among numerous clients, some of whom inquire about ways to productionise their notebooks. Rather than productionising your notebooks, the optimal route to production readiness is to craft modular, maintainable, and reproducible code.

In this article, I present an example of a modular ML pipeline for training a model to classify fraudulent credit card transactions. By the conclusion of this article, I hope that you will:

Gain an appreciation and understanding of modular ML pipelines.
Feel inspired to build one for yourself.

If you want to reap the benefits of deploying your machine learning models for maximum effect, writing modular code is an important step to take.

First a quick definition of modular code. Modular code is software design paradigm that emphasizes separating a program out into independent, interchangeable modules. We should aim to approach this state with our machine learning pipelines.

Quick Detour – The project, the Data, and Approach

The machine learning project is sourced from Kaggle. The dataset consists of 284,807 anonymised credit card transactions for which 492 are fraudulent. The task is to build a classifier to detect the fraudulent transactions.

The Data for this project is licensed for any purpose including commercial use under the Open Data Commons.

I have used a deep learning approach leveraging the Ludwig , an open-source framework for declarative deep learning. I won't go into the details of Ludwig here, however I have previously written an article on the framework.

The Ludwig deep neural network is configured with a .yaml file. For those that are curious you can find this in the model registry GitHub.

Building Modular Pipelines with Kedro

Building modular machine learning pipelines has been made easier with open-source tools, my favourite of these is Kedro. Not only because I have seen this used successfully in industry, but also because it helped me develop my software engineering skills.

Kedro is an open-source framework (licensed under Apache 2.0) for creating reproducible, maintainable and modular Data Science code. I came across it while I was developing the AI strategy for a bank, considering which tools my team could utilise to build production-ready code.

Disclaimer: I have no affiliation with Kedro or McKinsey's QuantumBlack, the creators of this open-source tool.

The Model Training Pipeline

Image by Author: End-to-end model training pipeline generated with Kedro viz

Kedro conveniently allows you to visualise your pipelines, a fantastic feature that can help bring clarity to your code. The pipeline is standard for machine learning so I will only briefly touch on each aspect.

Import Data: Import the credit card transaction data from an external source.
Split Data: Use random split to split the data into training and holdout sets.
Run Experiment: Uses the Ludwig framework to train a deep neural network on the train data set. The Ludwig experiment API conveniently saves model artefacts for every experiment run.
Run Predictions: Uses the model trained in the previous step to run predictions on the holdout dataset.
Model Diagnostics: Produces two diagnostic charts. First, the tracking the cross-entropy loss over each epoch. Second, the ROC curve measuring the model's performance on the holdout dataset.

Image by Author: Loss Curve from model training process

Image by Author: ROC Curve from model evaluation on the holdout dataset

Core Components of the Pipeline

Now that we have established a high-level view, let's get into some of the core components of this pipeline.

Project Structure

C:.
├───conf
│   ├───base
│   │   └───parameters
│   └───local
├───data
│   ├───01_raw
│   ├───02_intermediate
│   ├───03_primary
│   ├───04_feature
│   ├───05_model_input
│   ├───06_models
│   │   ├───experiment_run
│   │   │   └───model
│   │   │       ├───logs
│   │   │       │   ├───test
│   │   │       │   ├───training
│   │   │       │   └───validation
│   │   │       └───training_checkpoints
│   │   └───experiment_run_0
│   │       └───model
│   │           ├───logs
│   │           │   ├───test
│   │           │   ├───training
│   │           │   └───validation
│   │           └───training_checkpoints
│   ├───07_model_output
│   └───08_reporting
├───docs
│   └───source
│  
└───src
    ├───fraud_detection_model
    │   ├───pipelines
    │       ├───train_model
    └───tests
        └───pipelines

Kedro provides a templated directory structure that is established when you initiate a project. From this base, you can programmatically add more pipelines to your directory structure. This standardised structure ensures that every Machine Learning project is identical and easy to document, thereby facilitating ease of maintenance.

Data Management

Data plays a crucial role in machine learning. The ability to track your data becomes even more essential when employing machine learning models in a commercial setting. You often find yourself facing audits, or the necessity to productionise or reproduce your pipeline on someone else's machine.

Kedro offers two methods for enforcing best practices in data management. The first is a directory structure, designed for machine learning workloads, providing distinct locations for the intermediate tables produced during data transformation and the model artefacts. The second method is the data catalogue. As part of the Kedro workflow, you are required to register datasets within a .yaml configuration file, thereby enabling you to leverage these datasets in your pipelines. This approach may initially seem unusual, but it allows you and others working on your pipeline to track data with ease.

Orchestration – Nodes and Pipelines

This is really where the magic happens. Kedro provides you with pipeline functionality straight out of the box.

The initial building block of your pipeline is the nodes. Each executable piece of code can be encapsulated within a Node, which is simply a Python function that accepts an input and yields an output. You can then structure a pipeline as a series of nodes. Pipelines are easily constructed by invoking the node and specifying the inputs and outputs. Kedro determines the execution order.

Once pipelines are constructed, they are registered in the provided pipeline_registry.py file. The beauty of this approach is that you can create multiple pipelines. This is particularly beneficial in machine learning, where you might have a data processing pipeline, a model training pipeline, an inference pipeline, and so forth.

Once set up, it's straightforward enough to modify aspects of your pipeline.

Configuration

Kedro's best practices stipulate that all configurations should be handled through the provided parameters.yml file. From a machine learning perspective, hyperparameters fall into this category. This approach streamlines experimentation, as you can simply substitute one parameters.yml file with a set of hyperparameters for another, which is also much easier to track.

I have also included the locations of my Ludwig deep neural network model.yaml and the data source within the parameters.yml configuration. Should the model or the location of the data change – for instance, when moving between developers' machines – it would be incredibly straightforward to adjust these settings.

Reproducibility

Kedro includes a requirements.txt file as part of the templated structure. This makes it really straightforward to monitor your environment and exact library versions. However, should you prefer, you can employ other environment management methods, such as an environment.yml file.

Establishing a Workflow

If you're developing machine learning pipelines and considering using Kedro, it can initially present a steep learning curve, but adopting a standard workflow will simplify the process. Here's my suggested workflow:

Establish your working environment: I prefer using Anaconda for this task. I typically use an environment.yml file, containing all the dependencies needed for my environment, and employ the Anaconda Powershell command line to create my environment from this.
Create a Kedro project: Once you have Kedro installed – which should hopefully be declared in your environment.yml – you can create a Kedro project through the Anaconda command line interface.
Explore in Jupyter Notebooks: I construct an initial pipeline in Jupyter notebooks, a process familiar to most data scientists. The only difference is that, once your pipeline is built, you should tidy it up so that each cell could serve as one Node in your Kedro pipeline.
Register your data: Record the input and outputs for each data processing or data ingestion step in the data catalogue.
Add your pipeline: After conducting your exploration in notebooks, you'll want to create a pipeline. This is achieved through the command line interface. Running this command will add an additional folder to ‘pipelines', bearing the name of the pipeline you've just created. It's within this folder that you'll construct your nodes and pipelines.
Define your pipeline: This is the stage where you start transferring the code from your Jupyter notebooks into the node.py file in your pipeline folder, ensuring that nodes you intend to be part of a pipeline have inputs and outputs. Once the nodes are set up, proceed to define your pipeline in the pipeline.py file.
Register your pipelines: The pipeline_registry.py file offers a template for you to register your newly created pipeline.
Run your project: Once established, you can run any pipeline through the CLI and also visualise your project.

Production-ready pipelines fit into a wider ecosystem of machine learning operations. Read my article on MLOps for a deeper dive.

Building Machine Learning Operations for Businesses

Conclusion

Kedro is an excellent framework for delivering production-ready machine learning pipelines. Beyond the functionality discussed in this article, there are numerous integrations with other open-source libraries, as well as packages for documentation and testing. Kedro doesn't solve every issue related to model deployment – for instance, model versioning is likely better handled by another tool such as DVC. However, it will assist data scientists in a commercial setting to produce more maintainable, modular, and reproducible code that is ready for production. There is a relatively steep learning curve for complete novices, but the documentation is clear and includes guided tutorials. As with any of these packages, the best way to learn is to dive in and experiment.

Link to the full GitHub repo

Follow me on LinkedIn

Subscribe to medium to get more insights from me:

Join Medium with my referral link – John Adeojo

Should you be interested in integrating AI or data science into your business operations, we invite you to schedule a complimentary initial consultation with us: