Monitoring Machine Learning Models in Production: Why and How?

Author:Murphy | View: 23265 | Time: 2025-03-23 12:47:11

Machine Learning (ML) model development often takes time and requires technical expertise. As data science enthusiasts, when we acquire a dataset to explore and analyze, we eagerly train and validate it using diverse state-of-the-art models or employing data-centric strategies. It feels incredibly fulfilling when we optimize the model's performance as if all the tasks have been accomplished.

However, after deploying the model to production, there are plenty of reasons that contribute to lower model performance or degradation.

#1 The training data is generated through simulation

Data scientists often face limitations in accessing the production data, which results in training the model using simulated or sample data instead. While data engineers bear the responsibility of ensuring the representativeness of the training data in terms of scale and complexity, the training data still deviates to some extent from the production data. There is also a risk of systematic flaws in upstream data processing, such as data collection and labeling. These factors can impact the extraction of additional useful input features or hinder the model's ability to generalize well.

Example: Investor data in the financial industry or patient information in the healthcare industry is often simulated due to security and privacy concerns.

#2 The new production data exhibits a new data distribution

Over time, the characteristics of input features can also change, such as shifts in age groups, income ranges, or other customer demographics. The data source itself may even be completely replaced due to various cases. During the model development process, optimization relies on learning and capturing patterns from the majority group within the training data. However, as time progresses, the previous majority may transition into the minority in the production data, rendering the original static model inadequate for meeting the most recent production needs.

Example: The model is well trained initially using customer data in Asian regions. Recently since the business expands into the United States, the same model directly makes prediction based on the input features which have exhibited variations.

#3 The patterns that we predict are evolving

Apart from shifts in the distribution of input features, the relationship between the features and the target variable can also change in the evolving environment. These changes can occur in unexpected ways over time, rendering the original model progressively ineffective.

Sudden concept change

These changes can occur abruptly, sometimes within a few weeks, due to unforeseen circumstances.

Example: The surge in demand for virtual meeting services during public COVID-19 lockdowns.

Gradual concept change

This type of change takes longer to manifest and is often a natural progression.

Example: The gradual increase in the price of dairy products due to long-term inflation.

Recurrent concept change

These changes can happen periodically, often during specific times of the year.

Example: The rapid growth in e-commerce sales during various special days like Black Friday and the Saturday before Christmas.

Model development is only a tiny part of a production-ready ML system

Many companies emphasize data-driven decision-making through the use of ML applications. Imagine that your ML models are utilized in critical applications, such as medical diagnosis, to support healthcare professionals in identifying diseases and conditions. Any model degradation means impacting the accuracy of diagnoses, potentially resulting in incorrect treatment plans and compromised patient outcomes. There are tons of cases with high importance in the real world, thus it becomes imperative to implement objective and continuous monitoring to detect any possible shifts.

In the following sections, we will delve into various layers of monitoring and provide illustrative Python code examples to demonstrate their implementation.

1 Monitor the model performance metrics

To detect any sign of model degradation, one of the direct and effective ways is to keep track of the performance metrics over time.

Performance metrics for regression model: R-squared, Root mean squared error (RMSE), and mean absolute error (MAE)
Performance metrics for classification model: Precision, recall, and F1-score

These performance metrics collected from the initial deployment serve as benchmarks for ongoing monitoring and evaluation. Periodically reassessing them is crucial whenever a new batch of ground truth data is collected, such as upon completion of a marketing campaign. If the error metrics rise above a predefined threshold, or if the metrics such as R-squared fall below the threshold, it is necessary to consider re-executing the data engineering process and retraining the model.

While this monitoring approach provides valuable insights into any drifts, it tends to lag. We can take a more proactive approach to monitor the latest input data.

2 Detect the changes in data distribution

Instead of waiting for sufficient input data for reliable evaluation of model performance, we can apply statistical methods for comparing the data distributions of two datasets. In our case, we can determine if the distribution of the training dataset is the same as that of the latest production dataset. If it is not statistically confident that two distributions are the same, it suggests a drift in the model. This is as a proxy for performance changes.

Kolmogorov-Smirnov Test (K-S Test): nonparametric test (i.e. with no assumption on the underlying data distribution) for numeric features, it is more sensitive near the center of the distribution than at the tails.

Interpretation: When p-value < 0.05, an alert indicating the presence of a drift is triggered.

Population Stability Index (PSI): A statistical test that can be applied to both numeric and categorical variables. It is a metric that shows how each variable has diverged independently from the baseline values. It is sometimes called the Characteristic Stability Index (CSI) when evaluating the distribution of features rather than target variables.

Interpretation: A value 0~0.1 means no significant distribution change; a value 0.1~0.2 considered a moderate distribution change; and a value larger than 0.2 interpreted as a significant distribution change.

Other famous statistical tests include Kullback-Leibler divergence, Jensen-Shannon divergence, and Wasserstein distance.

3 Monitor drift using a sliding window approach

Any delay in detecting data or pattern drifts results in a time gap, potentially leading to discrepancies between the ground truth and model predictions. To bridge the gap, are there any further advancements available? One promising idea is leveraging streaming data instead of batch data.

The Adaptive Windowing (ADWIN) algorithm utilizes a sliding window approach to effectively detect concept drift. Unlike traditional fixed-size windows, ADWIN decides the size of the window by cutting the statistics window at different points. Whenever newly arriving data comes, ADWIN analyses the statistics and identifies the point at which two sub-windows demonstrate notable differences in their means.

Interpretation: When the absolute difference between the two means exceeds a pre-defined threshold, an alert indicating the presence of a drift is triggered.

Example Walkthrough

Let's explore an example that demonstrates the implementation of the above monitoring strategies.

We will utilize a dataset obtained from Kaggle. The dataset comprises 100k records, encompassing 28 features **** that describe the customer demographics and their credit-related history.

Our objective is to segregate customers in a global financial company into credit score brackets. The target variable is Credit_Score, a categorical measure classified as ‘Poor', ‘Standard', and ‘Good'.

Examples of features include:

Occupation: Occupation of the customers (e.g. scientist, teacher, engineer, etc.)
_AnnualIncome: Annual income of the customers
_Credit_HistoryAge: The age of customers' credit history
_PaymentBehaviour: 6 groups of payment behaviors based on spending frequency (low/ high) and payment amount (small/ medium/ large)

Model performance metrics

Let's start by applying data cleansing and transformation techniques, including but not limited to:

Correcting/ Removing records with missing values or incorrect data (e.g. age < 0), or duplicates
Detecting and handling data outliers
Performing min-max scaling on numeric variables
Applying label encoding on categorical variables

This part requires in-depth exploratory data analysis. However, since the focus is on sharing the monitoring strategies, the transformations performed are not discussed in detail here.

Afterward, we split the dataset and developed the advanced gradient boosting algorithm, LightGBM.

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from lightgbm import LGBMClassifier

# Convert target variable to numeric
df['Credit_Score'] = df['Credit_Score'].str.replace('Good', '3', n=-1)
df['Credit_Score'] = df['Credit_Score'].str.replace('Standard', '2', n=-1)
df['Credit_Score'] = df['Credit_Score'].str.replace('Poor', '1', n=-1)
df['Credit_Score'] = df[['Credit_Score']].apply(pd.to_numeric)

# Split the dataset
X=df.loc[:, df.columns != 'Credit_Score']
Y=df['Credit_Score']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

# Train the LightGBM model
lgbm = LGBMClassifier()
lgbm.fit(x_train, y_train)
y_pred = lgbm.predict(x_test)

# Print performance metrics
print('F1 score: %.3f' % f1_score(y_test, y_pred, average='weighted'))
print('Precision: %.3f' % precision_score(y_test, y_pred, average='weighted'))
print('Recall: %.3f' % recall_score(y_test, y_pred, average='weighted'))

During our model development, we gathered several performance metrics, including an F1 score of 0.810, a precision score of 0.818, and a recall score of 0.807. Subsequent monitoring can be evaluated similarly. For instance, if the F1 score falls below 0.75, it serves as an alert, prompting us to take immediate action to mitigate the issue.

K-S Test

To demonstrate how the K-S Test works, I have selected a numeric feature _Credit_HistoryAge. We will examine the sensitivity of this statistical test by creating three distinct datasets: one with 1000 samples, another with 5000 samples, and a third with approximately 64000 samples, which corresponds to the total size of the cleaned training samples. Each data point within the _Credit_HistoryAge feature has been randomly picked from the training data and altered by a random floating-point number.

Data distributions of ‘Credit_History_Age' in original data and drifted data (Image by author)

We obtained the following results:

For the dataset with 1000 samples, the p-value of the K-S Test is 0.093.
For the dataset with 5000 samples, the p-value of the K-S Test is 0.002.
For the dataset with around 64000 samples, the p-value of the K-S Test is 0.000.

When examining the data drift scenario using 1000 samples, the p-value exceeded 0.05. Consequently, we can conclude that the two distributions remain similar. However, as the sample size increased to 5000 and beyond, the K-S Test exhibited exceptional performance, yielding p-values significantly lower than 0.05. This provides a clear indication of a data distribution drift, serving as a clear alert.

from scipy import stats

# Create new datasets with different no. of samples
original_df = x_train[['Credit_History_Age', 'Payment_Behaviour']].reset_index(drop=True)
new_df = x_train[['Credit_History_Age', 'Payment_Behaviour']].reset_index(drop=True)
new_df1 = new_df.sample(n = 1000).reset_index(drop=True)
new_df2 = new_df.sample(n = 5000).reset_index(drop=True)
new_df3 = new_df.sample(n = len(x_train)).reset_index(drop=True)

# Prepare drifted data for numeric feature
def drift_numeric_col(df, numeric_col, drift_range):
    df[numeric_col] = df[numeric_col] + np.random.uniform(0, drift_range, size=(df.shape[0], ))

drift_numeric_col(new_df1, 'Credit_History_Age', 2)
drift_numeric_col(new_df2, 'Credit_History_Age', 2)
drift_numeric_col(new_df3, 'Credit_History_Age', 2)

# K-S Test
def ks_test(original_df, new_df, numeric_col):
    test = stats.ks_2samp(original_df[numeric_col], new_df[numeric_col])
    print("Column : %s , p-value : %1.3f" % (numeric_col, test[1]))

# Conduct K-S Test for numeric feature
ks_test(original_df, new_df1, 'Credit_History_Age')
ks_test(original_df, new_df2, 'Credit_History_Age')
ks_test(original_df, new_df3, 'Credit_History_Age')

PSI

In addition to utilizing the K-S Test, we will also leverage the PSI (Population Stability Index) to evaluate the numeric feature _Credit_HistoryAge and assess the categorical feature _PaymentBehaviour. To represent the drift effect, we have randomly replaced 80% of the data of the feature _PaymentBehaviour with specific label values.

# Prepare drifted data for categorical column
def drift_cat_col(df, cat_col, drift_ratio):
    no_of_drift = round(len(df)*drift_ratio)
    random_numbers = [random.randint(0, 1) for _ in range(no_of_drift)]
    indices = random.sample(range(len(df[cat_col])), no_of_drift)
    df.loc[indices, cat_col] = random_numbers

drift_cat_col(new_df1, 'Payment_Behaviour', 0.8)
drift_cat_col(new_df2, 'Payment_Behaviour', 0.8)
drift_cat_col(new_df3, 'Payment_Behaviour', 0.8)

Data distributions of ‘Payment_Behaviour' in original data and drifted data (Image by author)

Numeric feature _Credit_HistoryAge

With a sample size of 1000, the PSI value is 0.023.

With a sample size of 5000, the PSI value is 0.015.

With a sample size of approximately 64000, the PSI value is 0.021.

Categorical feature _PaymentBehaviour

With a sample size of 1000, the PSI value is 0.108.

With a sample size of 5000, the PSI value is 0.111.

With a sample size of approximately 64000, the PSI value is 0.112.

All the PSI values for the _Credit_HistoryAge feature are significantly lower than 0.1, indicating no significant distribution change. By comparing these results with those obtained from the K-S Test, we observe that the K-S Test exhibits greater sensitivity in detecting distribution changes compared to PSI.

On the other hand, the PSI values for the _PaymentBehaviour feature are around 0.11, signifying a moderate distribution change. Interestingly, the three PSI values remain relatively consistent, implying that PSI's effectiveness is less dependent on sample sizes. Besides, PSI has the flexibility to monitor across various feature types, positioning it still a valuable approach for drift detection.

Below is the implementation code of PSI:

def psi(data_base, data_new, num_bins = 10):
    # Sort the data
    data_base = sorted(data_base)
    data_new = sorted(data_new)

    # Prepare the bins
    min_val = min(data_base[0], data_new[0])
    max_val = max(data_base[-1], data_new[-1])
    bins = [min_val + (max_val - min_val)*(i)/num_bins for i in range(num_bins+1)]
    bins[0] = min_val - 0.0001
    bins[-1] = max_val + 0.0001

    # Bucketize the baseline data and count the samples
    bins_base = pd.cut(data_base, bins = bins, labels = range(1,num_bins+1))
    df_base = pd.DataFrame({'base': data_base, 'bin': bins_base})
    grp_base = df_base.groupby('bin').count()
    grp_base['percent_base'] = grp_base['base'] / grp_base['base'].sum() 

    # Bucketize the new data and count the samples
    bins_new = pd.cut(data_new, bins = bins, labels = range(1,num_bins+1))
    df_new = pd.DataFrame({'new': data_new, 'bin': bins_new})
    grp_new = df_new.groupby('bin').count()
    grp_new['percent_new'] = grp_new['new'] / grp_new['new'].sum()

    # Compare the bins
    psi_df = grp_base.join(grp_new, on = "bin", how = "inner")

    # Calculate the PSI
    psi_df['percent_base'] = psi_df['percent_base'].replace(0, 0.0001)
    psi_df['percent_new'] = psi_df['percent_new'].replace(0, 0.0001)
    psi_df['psi'] = (psi_df['percent_base'] - psi_df['percent_new']) * np.log(psi_df['percent_base'] / psi_df['percent_new'])

    # Return the total PSI value
    return np.sum(psi_df['psi'].values)

# Conduct K-S Test for numeric feature
psi(original_df['Credit_History_Age'], new_df1['Credit_History_Age'])
psi(original_df['Credit_History_Age'], new_df2['Credit_History_Age'])
psi(original_df['Credit_History_Age'], new_df3['Credit_History_Age'])

# Conduct K-S Test for categorical feature
psi(original_df['Payment_Behaviour'], new_df1['Payment_Behaviour'])
psi(original_df['Payment_Behaviour'], new_df2['Payment_Behaviour'])
psi(original_df['Payment_Behaviour'], new_df3['Payment_Behaviour'])

ADWIN algorithm

Lastly, we test the power of ADWIN in detecting the change in the numeric feature _Credit_HistoryAge. The data stream consists of the training data, followed by the drifted data. We expect that the algorithm will promptly identify the drift shortly after examining the entirety of the original data.

To visually represent the situation, a scatter plot is created to showcase the final 500 points of the original data in blue, followed by the initial 500 points of the drifted data, depicted in green. The drifted data exhibits a slightly higher average value.

By continuously adding stream elements, ADWIN identifies the change at index 64457, which is the 637th data point within the drifted data. In comparison, the K-S Test and PSI necessitate a larger number of data points to conclude the presence of a drift confidently. The better performance of ADWIN is a testament to its capability to monitor diverse features at speed and ease.

Below is the implementation code of ADWIN:

from skmultiflow.drift_detection import ADWIN

adwin = ADWIN()
data_stream=[]
data_stream = np.concatenate((original_df['Credit_History_Age'],new_df3['Credit_History_Age']))

# Add stream elements to ADWIN and verify if drift occurred
for i in range(len(data_stream)):
  adwin.add_element(data_stream[i])
  if adwin.detected_change():
  print('Change detected at index {}'.format(i))
  adwin.reset()

Wrapping it up

We've delved into the crucial concepts of data drift and model drift, which can lead to model decay in production. We can proactively monitor and detect drift conditions using model performance metrics, statistical tests, and adaptive windowing techniques. Unlike participating in a one-time Kaggle competition, building a production-ready ML system is an iterative journey. It requires a mindset shift towards integrating comprehensive model monitoring to ensure a robust and consistently high-performance serving space.

Before you go

If you enjoy this reading, I invite you to **** follow my Medium page. By doing so, you can stay updated with exciting content related to data science side projects, Machine Learning Operations (Mlops) demonstrations, and project management methodologies.