Simple Model Retraining Automation via GitHub Actions

Author:Murphy  |  View: 28796  |  Time: 2025-03-22 21:13:05

Machine Learning models could create immense value for the business. However, developing them isn't a one-time activity. Instead, it's a continuous process for the model to keep providing values. This is where MLOps came from.

The combination of CI/CD principles with machine learning development is what we call MLOps, which aims to provide continual value with the model.

One way for the Machine Learning model to bring constant benefit is by retraining them when necessary, for example, if the data drift is detected. We could perform the model retraining automation by setting the environment for retraining triggers.

We can use the Github tool called GitHub Actions to facilitate the retraining process. This tool is a feature from GitHub for the CI/CD platform used to automate the software development process from the GitHub repository.

This article will teach us how to perform model retraining automation controlled with the GitHub Actions. How to do that? Let's get into it.


Preparation

We would perform a simple model development and automation demonstration for this project. The overall project structure would look like the below chart.

Image by Author

Let's start by preparing the GitHub repository to use the GitHub Actions within this repo. You can create an empty repository with your preferred name. For myself, I make this repository.

Additionally, we would simulate the model deployment with Docker. For that, let's install the Docker Desktop. Also, sign up for Dockerhub if you haven't done that.

Then, let's create the GitHub Personal Access Token (PAT) with the Repo and Workflow scope. Set aside the token somewhere, and let's get back to the empty repository you just created. Go to the settings and select "Secret and variables". Next, create the Repository secrets containing your PAT, Docker username, and Docker password.

Image by Author

Pull the GitHub repository to your local or any platform you are working on. Once you are ready, let's prepare the overall structures for the tutorial. In your favorite IDE, create the folders as shown below.

diabetes-project/
├── data/
├── notebooks/
├── scripts/
├── models/
├── .github/
│   └── workflows/
|─────────────

We would set up the virtual environment when your folders were in place. It's a best practice as we want to have an isolated environment. Go to your root folder and use the following code in the CLI.

python -m venv your_environment_name

Then, you can activate the virtual environment by running the code below.

your_environment_nameScriptsactivate

After we activate the virtual environment, we will install all the necessary packages for the tutorial. Create the file called requirements.txt in your root folder and fill them with the following package.

fastapi
uvicorn
pandas
scikit-learn
matplotlib
seaborn
evidently

We would install the packages in the virtual environment when the requirement.txtfile ready.

pip install -r requirements.txt

All the preparation is ready now, and we can move on to develop the model and initiate the model retraining automation.


Model Development

This tutorial will use the Open-Source Diabetes dataset (CC0: Public Domain) as our example dataset. Download the dataset and put it in the Data folder. For me, I rename the dataset as data.csv but you can change it to your preferred name.

We will perform the initial model development in the Jupyter Notebook. Create your notebook and put it in the notebooks folder. Then, let's start by reading the dataset.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data_path = '..//data//data.csv'

df = pd.read_csv(data_path)

We would focus on something other than data exploration as this article focuses on demonstrating the GitHub Actions capability for retraining automation. I have included the data exploration part in the notebook if you want to visit them.

For now, let's move into the data preprocessing and pipeline initiation. We would use the data pipeline to simulate the standard development process.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

X = df.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = X.columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])

Once the pipeline is ready, we will use the Random Forest algorithm as our Machine Learning model. You can choose any other model that serves your purpose.

from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

pipeline.fit(X_train, y_train)

Let's evaluate the model and see how it is performed.

Python">from sklearn.metrics import classification_report

y_pred = pipeline.predict(X_test)

# Evaluate the model
report = classification_report(y_test, y_pred)
print(report)
Image by Author

Overall, the performance is acceptable. It could be better, but let's continue with this current model and save it in our models folder.

import pickle

with open('..//models//pipeline.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

When our model is ready, we will deploy them in production. We would deploy them as API and use Docker to containerize the model.

To deploy the model as an API, let's create a file called app.py and save it in the scripts folder. Use the following code inside the file to make the model an API.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import pandas as pd

app = FastAPI()

columns = ['Pregnancies', 'Glucose', 'BloodPressure', 
'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

dict_res = {0: 'Not-Diabetes', 1: 'Diabetes'}

pipeline_path = 'models/pipeline.pkl'
with open(pipeline_path, 'rb') as pipeline_file:
    pipeline = pickle.load(pipeline_file)

class DataInput(BaseModel):
    data: list

@app.post("/predict")
async def predict(input_data: DataInput):
    try:
        df = pd.DataFrame(input_data.data, columns=columns)
        predictions = pipeline.predict(df)
        results = [dict_res[pred] for pred in predictions]

        return {"predictions": results}

    except Exception as e:
        print("Error:", str(e))
        raise HTTPException(status_code=400, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Let's test if we can access the model API. First, we would run the following code in the CLI to start the application.

uvicorn scripts.app:app --host 0.0.0.0 --port 8000

Then, run the following code in your Jupyter Notebook to test the API.

import requests

url = "http://localhost:8000/predict"

data = {
    "data": [
        [1, 85, 66, 29, 0, 26.6, 0.351, 31]
    ]
}

response = requests.post(url, json=data)
print(response.json())

Ensure that the position of the data you pass into the API is the same as that of your training data. If the API works well, we will build the Docker image and push it to the hub.

First, let's create the dockerfile in the root folder. Fill that file with the following code.

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

COPY scripts/app.py app.py
COPY models models

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

In the code above, we would set up a Python environment and copy the necessary file into the container for running the API while listening on port 8000.

With the dockerfile We can build the image with the following code when ready.

docker build -t username/image_name -f Dockerfile .
docker login -u username
docker push username/image_name:latest

Change the username into your Dockerhub user name and the image_name into your preferred application name. You should see your image in the Dockerhub, like mine, if it's successful.

So, why did we contain our model API in Docker and push it into the Dockerhub? It ensures consistency across all environments in which you would run the application.

It also demonstrates how powerful GitHub Actions is in the next section in retraining the model and pushing it back into this container. Hence, we only need to pull the image to deploy the model.

Run the code below to pull the image from the Dockerhub and run it in your local environment.

docker login -u username
docker pull username/image_name:latest
docker run -d -p 8000:8000 username/image_name:latest

With this, we have our model in production. In the next part, we will see how we can retrain the model with GitHub Actions with certain triggers.


Model Retraining with GitHub Actions

As I have mentioned, the machine learning model is a continuous project if we want to provide any value from them. This is because we can't expect the model to keep producing the same quality every time, especially if drift happens.

In this tutorial, we will learn how to perform automatic model retraining when data drift is detected in the production dataset. First, let's see how we could detect the drift in our dataset.

Let's simulate the drift in a dataset with the following code.

import numpy as np

def introduce_drift(data, drift_features, drift_amount=0.1, random_seed=42):
    np.random.seed(random_seed)
    drifted_data = data.copy()

    for feature in drift_features:
        if feature in data.columns:
            drifted_data[feature] += np.random.normal(loc=0, scale=drift_amount, size=data.shape[0])

    return drifted_data

features_to_drift = ['Glucose', 'BloodPressure', 'SkinThickness', 'Pregnancies']

drifted_data = introduce_drift(X_test, features_to_drift, drift_amount=50)
drifted_data = drifted_data.reset_index(drop = True)

In the code above, we drifted some of the columns in the Test data. You can play around with the drift_amount to control how drifted are the data.

We would need the training data (reference) and the drift data (new) for our tutorial. I would also save the target column, which we would use later for retraining examples.

reference_data['Outcome'] = y_train.reset_index(drop = True)
drifted_data['Outcome'] = y_test.reset_index(drop = True)

drifted_data.to_csv('..//data//new_data.csv', index=False)
reference_data.to_csv('..//data//reference_data.csv', index=False)

Using Evidently (I have no affiliate in any form to Evidently), we would examine if the production data has drifted compared to the reference data. We can do that with the following code.

from evidently.metric_preset import DataDriftPreset
from evidently.report import Report

data_drift_report = Report(metrics=[
    DataDriftPreset(),
])

data_drift_report.run(current_data=drifted_data.drop('Outcome', axis =1), 
reference_data=reference_data.drop('Outcome', axis =1), column_mapping=None)
report_json = data_drift_report.as_dict()
drift_detected = report_json['metrics'][0]['result']['dataset_drift']
Image by Author

The result shows there is a drift dataset that we will use later for the retraining automation.

To simulate the drift detection, let's create a file called drift_detection.py and save them in the scripts folder. Fill the file with the code below.

import pandas as pd
from evidently.metric_preset import DataDriftPreset
from evidently.report import Report

reference_data = pd.read_csv('data/reference_data.csv')
new_data = pd.read_csv('data/new_data.csv')

data_drift_report = Report(metrics=[
    DataDriftPreset()
])

data_drift_report.run(reference_data=reference_data.drop('Outcome', axis =1), 
                      current_data=new_data.drop('Outcome', axis =1), column_mapping=None)

report_json = data_drift_report.as_dict()
drift_detected = report_json['metrics'][0]['result']['dataset_drift']

if drift_detected:
    print("Data drift detected. Retraining the model.")
    with open('drift_detected.txt', 'w') as f:
        f.write('drift_detected')
else:
    print("No data drift detected.")
    with open('drift_detected.txt', 'w') as f:
        f.write('no_drift')

In the code above, we save the boolean result to the drift_detected.txt file and print the information if the drift is detected or not. In case of drift is detected, we want to retrain the model. For that, we need to prepare the training script as well.

Create file called train_model.py in the scripts folder and fill it with the following code.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
import pickle

reference_data = pd.read_csv('data/reference_data.csv')
new_data = pd.read_csv('data/new_data.csv')

df= pd.concat([reference_data, new_data], ignore_index=True)

X = df.drop('Outcome', axis=1)
y = df['Outcome']

numeric_features = X.columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

pipeline.fit(X, y)

with open('models/pipeline.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

The code above would combine the training and drift data into a new training model, which is then used to train the new model. This is only a simplified approach, as the real-world training data would need more preparation, and the new model would need proper evaluation.

Nevertheless, with all the scripts ready, we would prepare the GitHub Actions to retrain the model when drift is detected. We must prepare the YAML file containing all the configurations required for the retraining.

So, let's create the file called mlops_pipeline.yml in the folder .githubworkflows. Ensure the folder name is correct; the GitHub Actions need the proper name. Fill the mlops_pipeline.yml with the code below.

name: Diabetes Retraining Pipeline with Data Drift Detection

on:
  push:
    paths:
      - 'data/new_data.csv'
permissions:
  contents: write
jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2 
      with:
        python-version: 3.9

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run data drift detection
      run: |
        python scripts/drift_detection.py
      continue-on-error: true 

    - name: Check for data drift
      id: check_drift
      run: |
        if grep -q 'drift_detected' drift_detected.txt; then
          echo "Data drift detected."
          echo "drift=true" >> $GITHUB_ENV
        else
          echo "No data drift detected."
          echo "drift=false" >> $GITHUB_ENV
        fi
      shell: bash

    - name: Model Retraining if Data Drift detected
      if: env.drift == 'true'
      run: |
        python scripts/train_model.py

    - name: Commit and push updated model
      if: env.drift == 'true'
      env:
        GIT_COMMITTER_NAME: github-actions
        GIT_COMMITTER_EMAIL: [email protected]
      run: |
        git config --global user.name "github-actions"
        git config --global user.email "[email protected]"
        git remote set-url origin https://x-access-token:${{ secrets.ACTIONS_PAT }}@github.com/username/image_name.git
        git add models/pipeline.pkl
        git commit -m "Update model after retraining on $(date -u +'%Y-%m-%d %H:%M:%S UTC')"
        git push

    - name: Build Docker image
      if: env.drift == 'true'
      run: |
        docker build -t username/image_name -f dockerfile .

    - name: Log in to Docker Hub
      if: env.drift == 'true'
      run: echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin

    - name: Push Docker image to Docker Hub
      if: env.drift == 'true'
      run: |
        docker push username/image_name:latest

    - name: Notify about the process
      run: |
        if [[ "$GITHUB_ENV" == *"drift=false"* ]]; then
          echo "No data drift detected. No retraining necessary."
        else
          echo "Data drift detected. Model retrained and deployed."
        fi
      shell: bash

The overall configuration structures we did in the YAML above are shown in the image below.

Image by Author

The trigger for the GitHub Actions we used is when a new_data.csv file is pushed inside the data folder. However, the model retraining would only run when the drift is detected. If the model is retrained, we will push them back into the GitHub repository and the Docker Hub.

Don't forget to change every username/image_name Docker identification to yours. You can also create the Repository Secrets if the identifier is always the same.

Once all the files are ready, you should push them into your GitHub repository. Then, you could try to create new drifted data and save them as new_data.csv and try to push them once more to the repository.

Go to the Action tab in your GitHub repository. If it is successfully run, you should see one job called build with Success status.

Image by Author

Click on the job to get all the details of the process. You can see each step's information to understand the process or see if it failed to run.

Image by Author

If you go into your models In the repository, you can examine if the model has been updated. We use the commit message to be informed when the model is retrained.

Image by Author

You could also check your Docker Hub repository to see if the image has been updated.

That's all you need to use GitHub Actions to simplify the model retraining process. You can tweak all the scripts according to your needs, for example, the trigger, the retraining condition, the dataset, and many others.

If you need all the code we use in this article, I have pushed them into this repository.


Conclusion

In this article, we have learned how to use GitHub Actions to automate the model retraining process. By setting the configuration with the YAML file and deciding the trigger, we can easily use GitHub Actions to streamline any necessary process.

Tags: Education Github Machine Learning Python Technology

Comment