Can Synthetic Data Boost Machine Learning Performance?

Author:Murphy | View: 24337 | Time: 2025-03-23 18:11:20

Image by Author: Generated with Midjourney

Background – Imbalanced Datasets

Imbalanced classification problems frequently occur in commercial machine learning use cases. You may encounter them in churn prediction, fraud detection, medical diagnosis, or spam detection. In all these scenarios, what we aim to detect belongs to the minority class, which can be highly underrepresented in our data. There are several approaches proposed for enhancing the performance of models on imbalanced datasets:

Undersampling: Achieve a more balanced training dataset by randomly undersampling the majority class.
Oversampling: Obtain a balanced training dataset by randomly oversampling the minority class.
Weighted Losses: Assign weights to the loss function in relation to the minority class.
Synthetic Data: Use generative AI to create high-fidelity synthetic data samples of the minority class.

In this article I demonstrate how training a model on synthetic data surpasses the other approaches in enhancing the performance of the classifier.

The Dataset

The data is sourced from Kaggle, consisting of 284,807 credit card transactions, 492 (0.172%) of which are labelled as fraudulent. The data is available for both commercial and non-commercial usage under the Open Data Commons license.

For interested readers, Kaggle offers more detailed information and basic descriptive statistics about the data.

From this Kaggle dataset, I create two subsets: a training set and a holdout set. The training set comprises 80% of the total data, along with synthetically generated samples when exploring that approach. The holdout set constitutes 20% of the original data, excluding any synthetic samples.

The Model

I use Ludwig, an open-source, declarative framework for building deep learning models due to it's ease of implementation. Models are easily built and trained by declaring them in a yaml file and running a training job through Ludwig's python API. I have written an article previously that details Ludwig for those who are interested.

For each approach, I use the same baseline model, only adjusting specific parameters as necessary. For instance, Ludwig allows for weight and sampling adjustments natively – These are simply adjusted in the yaml file. I have provided links to the model configuration yaml files for each approach for your exploration.

Baseline Model – link
Weighted losses model – link
Undersampling model – link
Oversampling model – link
Synthetic data – Utilises the same model as the baseline as the classes are balanced.

Generating Synthetic Data

I utilise Synthetic Data Vault (SDV), an open-source library for generating synthetic data samples. With SDV, I generate an additional 284k synthetic fraud samples, thereby achieving equal representation of both classes in the training dataset.

The synthetic samples are generated with variational autoencoders adapted for tabular data (TVAE). You can find more details on the theory behind TVAEs in this paper.

SDV offers diagnostic statistics, giving an indication of fit quality. You can manually explore the fit quality by comparing variable distributions in the real data versus the generated data as shown in the examples below.

Image by Author: Real vs synthetic distribution for variable v1

Image by Author: Real vs synthetic distribution for variable v10

Image by Author: Real vs. synthetic distribution for variable amount

Assessing Performance with Precision Recall Charts

We assess the performance of each model by plotting the precision versus recall curves of the models against the holdout dataset.

Precision-Recall Curve

The Precision-Recall curve, a plot of Precision (on the y-axis) against Recall (on the x-axis) for varying thresholds, is akin to the [ROC curve](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc). It serves as a robust diagnostic tool for evaluating model performance in scenarios of significant class imbalance, such as our credit card fraud detection use case, a prime example.

The top-right corner of the plot represents the "ideal" point – a false positive rate of zero and a true positive rate of one. A skilled model should reach this point or come close to it, implying a larger area under the curve (AUC-PR) can suggest a superior model.

No Skill Predictor

A "no skill" predictor is a naïve model that makes predictions randomly. For imbalanced datasets, the no skill line is a horizontal line at a height equivalent to the positive class proportion. This is because if the model randomly predicts the positive class, precision would be equivalent to the positive instances proportion in the dataset.

Model Performance – Baseline

The baseline model is the deep neural network with no sample adjustments, loss function adjustments, or augmented training data. Each approach is compared to the baseline performance, which serves as a performance bench mark.

Image by Author: Precision-recall curve for baseline model

Model Performance – Weighted Losses Approach

Weighted loss adjusts the loss function based on the ratio of fraudulent to non-fraudulent transactions.

Image by Author: Precision-recall curve for loss weighted approach

Model Performance – Oversampling Approach

Oversampling randomly oversamples the fraudulent transactions until there is equal representation across the classes in the training dataset.

Image by Author: Precision-recall curve for oversampling approach

Model Performance – Undersampling Approach

Undersampling randomly undersamples the non-fraudulent transactions until there is equal representation across classes in the training dataset.

Image by Author: Precision-recall curve for undersampling approach

Model Performance – Synthetic Data Approach

Leverage the TVAEs to produce 284k synthetic, fraudulent samples to gain an equal representation across classes in the training dataset.

Image by Author: Precision-recall curve for synthetic data approach

Bootstrapping Holdout Dataset

To obtain a robust view of performance on the holdout set, I created fifty bootstrapped holdout sets from the original. Running the models associated with each approach across all sets provides a distribution of performance. We can then determine whether each approach is statistically significantly different from the baseline using the Kolmogorov-Smirnov test.

Weighted: The weighted approach marginally underperformed across recall and AUC relative to the baseline. In addition to this, the variance across each performance metric appears quite high relative to the other approaches.

Image by Author: Model performance metrics over 50 bootstrapped holdout samples. Baseline vs Weighted Loss, KS stats – AUC 0.420 p-value < 0.000, precision 0.260 p-value 0.068, Recall 0.520 p-value < 0.000

Oversampling: The oversampling approach improves model recall relative to baseline, but results in a drastic deterioration of the precision.

Image by Author: Model performance metrics over 50 bootstrapped holdout samples. Baseline vs Oversampling, KS stats – AUC 0.160 p-value 0.549, precision 1.0 p-value < 0.000, Recall 0.9 p-value < 0.000

Undersampling: The approach performs worse than baseline across all metrics.

Image by Author: Model performance metrics over 50 bootstrapped holdout samples. Baseline vs Oversampling, KS stats – AUC 0.880 p-value < 0.000, precision 0.6 p-value < 0.000, Recall 1.0 p-value < 0.000

Synthetic: The synthetic method uplifts model recall, albeit at the cost of precision. While the impact on precision remains substantial, the synthetic approach provides a more resilient alternative for enhancing model recall with less of a detriment to precision when compared to the oversampling approach. The robustness of the synthetic approach is further evidenced by the uplift in AUC-PR.

Image by Author: Model performance metrics over 50 bootstrapped holdout samples. Baseline vs Synthetic, KS stats – AUC 0.620, Precision 0.560, Recall 0.360 all p-values ≤ 0.003

Conclusion

We've noted that the synthetic data approach can boost model recall relative to the baseline at the expense of precision. Oversampling accomplishes a similar result, but model precision suffers drastically in comparison.

In our specific context of credit card fraud detection, false positives are not as costly as false negatives. Therefore, we can afford to compromise on model precision if it results in a significant boost in recall. Enriching our training data with synthetic instances seems to be an effective strategy to enhance recall while mitigating the detrimental effects on precision. This enhancement could notably affect profitability, especially when scaling the model to handle millions of transactions. Ultimately, attributing a exact cost to false positives and negatives will provide us with a clearer understanding of the most commercially viable approach, a topic beyond the scope of this article.

It would be fascinating to examine the performance across varying sample sizes of synthetic data, perhaps in conjunction with weighted losses. Similarly, experimenting with diverse oversampling ratios could potentially yield comparable effects to what we have observed with the synthetic approach.

The notebook for this project is available in my GitHub repo

Follow me on LinkedIn

Subscribe to medium to get more insights from me:

Join Medium with my referral link – John Adeojo

Should you be interested in integrating AI or Data Science into your business operations, we invite you to schedule a complimentary initial consultation with us: