How to Tackle Class Imbalance Without Resampling

Author:Murphy | View: 25366 | Time: 2025-03-23 19:34:39

Imbalanced classification is a relevant machine learning task. This problem is usually handled with one of three approaches: resampling, cost-sensitive models, or threshold tuning.

In this article, you'll learn a different approach. We'll explore how to use clustering analysis to tackle imbalanced classification.

Introduction

Many real-world problems involve imbalanced data sets. In these, one of the classes is rare and, usually, more important to users.

Take fraud detection for example. Fraud cases are rare instances among vast amounts of regular activity. The accurate detection of rare but fraudulent activity is fundamental across many domains. Other common examples involving imbalanced data sets include customer churn or credit default prediction.

Imbalanced distributions are a challenge for machine learning algorithms. There's relatively little information about the minority class. This hinders the ability of algorithms to train good models because they tend to bias toward the majority class.

How to Deal with Class Imbalance

There are three standard approaches for dealing with class imbalance:

Resampling methods;
Cost-sensitive models;
Threshold tuning.

Resampling is arguably the most popular strategy for handling imbalanced classification tasks. This type of method transforms the training set to improve the relevance of the minority class.

Resampling can be used to create new cases for the minority class (over-sampling), discard cases from the majority class (under-sampling), or a combination of both.

Here's an example of how resampling methods work using the imblearn library:

from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

X_train, y_train = make_classification(n_samples=500, n_features=5, n_informative=3)
X_res, y_res = SMOTE().fit_resample(X_train, y_train)

Resampling methods are versatile and easy to couple with any learning algorithm. But, they have some limitations.

Under-sampling the majority class may lead to important information loss. Over-sampling may increase the chance of overfitting. This occurs if resampling propagates noise from cases of the minority class.

There are some alternatives to resampling the training data. These include tuning the decision threshold or using cost-sensitive models. Different thresholds lead to distinct precision and recall scores. So, adjusting the decision threshold can improve the performance of models.

Cost-sensitive models work by assigning different costs to misclassification errors. Errors in the minority class are typically more costly. This approach requires domain expertise to define the costs of each type of error.

Tackling Class Imbalance with Clustering

Most resampling methods work by finding instances close to the decision boundary – the frontier that splits the instances from the majority class from those of the minority class. Borderline cases are, in principle, the most difficult to classify. So, they are used to drive the resampling process.

Decision boundary of an SVM model. Original: Alisneaky Vector: Zirguezi, CC BY-SA 4.0. Image source.

For example, ADASYN is a popular over-sampling technique. It creates artificial instances using cases from the minority class whose nearest neighbors are from the majority class.

Finding borderline cases with clustering analysis

We can also capture which observations are close to the decision boundary using clustering analysis.

Suppose there's a cluster whose observations all belong to the majority class. This might mean that this cluster is somewhat far from the decision boundary, on the side of the majority class. Generally, those observations are easy to model.

On the other hand, an instance can be considered borderline if it belongs to a cluster that contains both classes.

We can use this information to build a hierarchical model for imbalanced classification.

How to build a hierarchical model for imbalanced classification

We build a hierarchical model based on two levels.

In the first level, a model is built to split easy instances from borderline ones. So, the goal is to predict if an input instance belongs to a cluster with at least one observation from the minority class.

In the second level, we discard the easy cases. Then, we build a model to solve the original classification task with the remaining data. The first level affects the second one by removing easy instances from the training set.

In both levels, the imbalanced problem is reduced, which makes the modeling task simpler.

Python implementation

The method described above is called ICLL (for Imbalanced Classification via Layered Learning). Here's its implementation:

from collections import Counter
from typing import List

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist

class ICLL:
    """
    Imbalanced Classification via Layered Learning
    """

    def __init__(self, model_l1, model_l2):
        """
        :param model_l1: Predictive model for the first layer
        :param model_l2: Predictive model for the second layer
        """
        self.model_l1 = model_l1
        self.model_l2 = model_l2
        self.clusters = []
        self.mixed_arr = np.array([])

    def fit(self, X: pd.DataFrame, y: np.ndarray):
        """
        :param X: Explanatory variables
        :param y: binary target variable
        """
        assert isinstance(X, pd.DataFrame)
        X = X.reset_index(drop=True)

        if isinstance(y, pd.Series):
            y = y.values

        self.clusters = self.clustering(X=X)

        self.mixed_arr = self.cluster_to_layers(clusters=self.clusters, y=y)

        y_l1 = y.copy()
        y_l1[self.mixed_arr] = 1

        X_l2 = X.loc[self.mixed_arr, :]
        y_l2 = y[self.mixed_arr]

        self.model_l1.fit(X, y_l1)
        self.model_l2.fit(X_l2, y_l2)

    def predict(self, X):
        """
        Predicting new instances
        """

        yh_l1, yh_l2 = self.model_l1.predict(X), self.model_l2.predict(X)

        yh_f = np.asarray([x1 * x2 for x1, x2 in zip(yh_l1, yh_l2)])

        return yh_f

    def predict_proba(self, X):
        """
        Probabilistic predictions
        """

        yh_l1_p = self.model_l1.predict_proba(X)
        try:
            yh_l1_p = np.array([x[1] for x in yh_l1_p])
        except IndexError:
            yh_l1_p = yh_l1_p.flatten()

        yh_l2_p = self.model_l2.predict_proba(X)
        yh_l2_p = np.array([x[1] for x in yh_l2_p])

        yh_fp = np.asarray([x1 * x2 for x1, x2 in zip(yh_l1_p, yh_l2_p)])

        return yh_fp

    @classmethod
    def cluster_to_layers(cls, clusters: List[np.ndarray], y: np.ndarray) -> np.ndarray:
        """
        Defining the layers from clusters
        """

        maj_cls, min_cls, both_cls = [], [], []
        for clst in clusters:
            y_clt = y[np.asarray(clst)]

            if len(Counter(y_clt)) == 1:
                if y_clt[0] == 0:
                    maj_cls.append(clst)
                else:
                    min_cls.append(clst)
            else:
                both_cls.append(clst)

        both_cls_ind = np.array(sorted(np.concatenate(both_cls).ravel()))
        both_cls_ind = np.unique(both_cls_ind)

        if len(min_cls) > 0:
            min_cls_ind = np.array(sorted(np.concatenate(min_cls).ravel()))
        else:
            min_cls_ind = np.array([])

        both_cls_ind = np.unique(np.concatenate([both_cls_ind, min_cls_ind])).astype(int)

        return both_cls_ind

    @classmethod
    def clustering(cls, X, method='ward'):
        """
        Hierarchical clustering analysis
        """

        d = pdist(X)

        Z = linkage(d, method)
        Z[:, 2] = np.log(1 + Z[:, 2])
        sZ = np.std(Z[:, 2])
        mZ = np.mean(Z[:, 2])

        clust_labs = fcluster(Z, mZ + sZ, criterion='distance')

        clusters = []
        for lab in np.unique(clust_labs):
            clusters.append(np.where(clust_labs == lab)[0])

        return clusters

The clustering part is done automatically without user input. So, the only thing you need to define is the learning algorithm on each level of the hierarchy.

And below is an example of how you can use the method. In this example, the model in each level is a Random Forest.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier as RFC

# https://github.com/vcerqueira/blog/blob/main/src/icll.py
from src.icll import ICLL

# creating a dummy data set
X, y = make_classification(n_samples=500, n_features=5, n_informative=3)
X = pd.DataFrame(X)

# creating a instance of the model
icll = ICLL(model_l1=RFC(), model_l2=RFC())
# training
icll.fit(X, y)
# probabilistic predictions
probs = icll.predict_proba(X)

A more serious example

How does the hierarchical method compare with resampling?

Below is a comparison based on a data set related to diabetes. You can check reference [1] for details. Here's how we can apply both methods to this data:

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from imblearn.over_sampling import SMOTE

# loading diabetes dataset https://github.com/vcerqueira/blog/tree/main/data
data = pd.read_csv('data/pima.csv')

X, y = data.drop('target', axis=1), data['target']
X = X.fillna(X.mean())

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# resampling with SMOTE
X_res, y_res = SMOTE().fit_resample(X_train, y_train)

# creating the models
smote = RFC()
icll = ICLL(model_l1=RFC(), model_l2=RFC())

# training 
smote.fit(X_res, y_res)
icll.fit(X_train, y_train)

# inference
smote_probs = smote.predict_proba(X_test)
icll_probs = icll.predict_proba(X_test)

Below is the ROC curve for each approach:

ROC curve for each method. Image by author.

ICLL's curve is closer to the top-left side, which indicates it is the better model.

A lot more experiments were carried out in the paper in reference [2] where ICLL was presented. The results suggest that ICLL provides competitive performance in imbalanced classification problems. You can check the code for the experiments on Github.

Key Takeaways

Imbalanced classification is an important task in Data Science;
Resampling the training set is a common approach to handling these problems. But, these may lead to information loss or the propagation of noise. Common alternatives are threshold tuning or cost-sensitive models;
You can also use hierarchical methods to handle the imbalance problem;
ICLL is a hierarchical method for imbalanced classification. It doesn't need any user parameters besides the learning algorithm. ICLL provides a competitive performance with resampling methods.

Hope you find this method useful. Thank you for reading, and see you in the next story!

References

[1] Pima Indians Diabetes data set (GPL-3 License)

[2] Cerqueira, V., Torgo, L., Branco, P., & Bellinger, C. (2022). Automated imbalanced classification via layered learning. Machine Learning, 1–22.

[3] Branco, Paula, Luís Torgo, and Rita P. Ribeiro. "A survey of predictive modeling on imbalanced domains." ACM Computing Surveys (CSUR) 49.2 (2016): 1–50.

Tags: Artificial Intelligence Data Science Imbalanced Data Machine Learning