Understanding Concept Drift: A Simple Guide

Author:Murphy | View: 27484 | Time: 2025-03-22 22:32:19

Concept drift detection and adaptation is a key stage in the monitoring of AI-based systems.

In this article, we'll:

Describe what concept drift is and how it arises in time-dependent data
Explore verification latency, and how it impacts change detection processes
Show a change detection example using scikit-multiflow

Introduction

Machine Learning models have an implicit assumption of stationarity. They expect the distribution of testing or production samples to be the same as the one in the training set.

And yet, this assumption is hardly ever met in real-world problems that exhibit a time-dependent structure.

The distribution of data in real-world environments tends to change over time. This change is called concept drift, and it happens across various application domains. In commerce, consumer interest in a service can change due to seasonal effects or emerging trends. In finance, shifts in the economy lead to changes in spending or credit conditions.

When change occurs, predictive systems need to be able to detect and adapt to them. Failing to do so can lead to drastic reductions in the accuracy of models and their reliability.

An example of a shift in the mean of a variable. Image by author.

Addressing concept drift involves two steps:

Detecting changes in distribution
Adapting the model to the new concept

In this article, we'll explore the first step. Let's start by describing the types of changes that can occur.

Types of concept drift

Supervised learning involves building a model using input and output data. The output data is the target variable (Y) that you are trying to model. For example, whether a bank should grant a loan to a given person. The input data (X) are the explanatory variables that describe the person asking for that loan, such as the persons' salary.

Changes in the distribution can be reflected in X, Y, or both. In general, these changes are categorized as follows:

Covariate shift (a.k.a virtual drift): When the distribution of the input data changes, but not the distribution of the output. So, for a given input you can expect the same outcome as before. Here's an example about the prediction of house prices. Most of the training set contains houses built in a particular architectural style. And, during testing, other types of architectural designs become more common. Yet, despite the change in the input data, the relation between covariates (specifically, architectural style) and price remains the same;

Visualization of concept drift and virtual drift. Image source here.

Label shift: This is the inverse case of covariate shift. The distribution of the output data changes, but not the distribution of the input data;
Concept drift: happens when the relationship between the input and output data changes. So, for the same input as before you expect a different behavior in the output. Following the example above, if there's a crash in house prices, the relationship between the covariates and price will change.

These types of change are not mutually exclusive. For example, concept drift may or may not occur with a covariate shift.

In any case, the underlying data distribution is different from the one used to train the model. So, it is likely that the optimal decision rule also changes and you should update your model.

Is concept drift the same as anomaly detection?

Not quite.

Changes in distribution reflect a change in the process that generates the data. For example, a shift in consumer behavior. An anomalous sample is one that deviates significantly from typical behavior, where typical behavior is defined by some distribution.

Verification Latency and Access to Labels

How do you detect concept drift?

Most approaches to change detection are based on tracking the performance of models. The loss (say, the error rate) is monitored over time. Then, change occurs if the loss increases significantly. This triggers some adaptation mechanism for updating the model. For example, a complete re-training process.

To compute the loss, you need to know the labels for the samples you're predicting. But, this can be too optimistic in real-time systems. In some cases, labels rarely (if ever) become available after a prediction is served.

The period between the processing of a sample and the arrival of its true label is called verification latency. Verification latency depends on the application domain. Yet, it's rare to encounter a real-world scenario where this value is negligible. In some cases, verification delay can take up to many weeks.

Illustrating the verification latency process. Image by author

In some domains, it is costly to annotate all records. More often than not, it requires a human annotator with domain expertise. Or, the system processes huge amounts of data which precludes their complete annotation.

So, you might not get the label for all records. Some samples have what is called an infinite verification delay.

The diagram below summarises different label availability scenarios.

So, performance-based change detection may not be suitable if it's difficult to get labels.

Suppose there's a significant verification delay and you wait to label most records. By the time you detect a change, lots of poor predictions have already been made. That can significantly hurt the business side of things.

All in all, getting the true labels is a challenge in real-time systems.

Change detection with scikit-multiflow

Let's see an example of how to detect changes. Here, we'll focus on tracking the error of a classifier. The scikit-multiflow library has several solutions for this problem.

Let's start by creating an example data stream.

from skmultiflow.data.sea_generator import SEAGenerator

stream = SEAGenerator(classification_function = 2, 
                      random_state = 112, 
                      balance_classes = False, 
                      noise_percentage = 0.28)

X, y = stream.next_sample(1000)

First, we get 1000 samples from the data stream. Here's a sample of this dataset:

Sample of the data stream. Image by author.

Then, we train an Adaptive Random Forest using the dataset. The scikit-multiflow framework is similar to scikit-learn.

from skmultiflow.meta import AdaptiveRandomForestClassifier

model = AdaptiveRandomForestClassifier()

model.fit(X, y)

After training the model, we are ready to monitor its performance. ADWIN (short for ADaptive WINdowing) is a popular approach to do this. The idea is to store the model error as the data stream is processed. ADWIN tracks the loss and triggers an alarm when it increases significantly.

Here's how to do this:

from skmultiflow.drift_detection.adwin import ADWIN

# creating an instance of ADWIN
change_detector = ADWIN()

# processing each instance of the data stream
while stream.has_more_samples():
    # getting a new sample
    X_i, y_i = stream.next_sample()

    # making the prediction
    pred = model.predict(X_i)

    # computing the accuracy for the sample
    err = int(pred != y_i)

    # adding the accuracy result to ADWIN
    change_detector.add_element(err)

    # checking if change occurs
    if change_detector.detected_change():
        print('Detected change')

If the change detector triggers an alarm, you should update your model.

Takeaways

Change detection and adaptation is important to keep models up to date. Changes can happen in different ways: in the input data, the target variable, or both.

When choosing a detection mechanism, you should take into account the verification latency – how long does it take to get the labels after making a prediction.

If tracking models' performance is viable, scikit-multiflow has several methods to do this.

Thank you for reading and see you in the next story!