Benchmarking Machine Learning Models with Cross-Validation and Matplotlib in Python

In this article, we will look at how to use Python to compare and evaluate the performance of machine learning models.
We will use cross-validation with Sklearn to test the models and Matplotlib to display the results.
The main motivation for doing this is to have a clear and accurate understanding of model performance and thus improve the model selection process.
Cross-validation is a robust method for testing models on data other than training data. It allows us to evaluate model performance on folds, data that has not been used to train the model itself, which gives us a more accurate estimate of model performance on real data.
For a detailed explanation of cross-validation, check this out article
We will use an object-oriented approach so that we can reuse it for other machine learning projects easily, making this method highly replicable.
The Benchmark class
To start, we will create a class called Benchmark
which will be responsible for testing models. The class will accept a dictionary of models, where the key will be the model name and the value will be the template object itself.
The class will also generate test data using the make_classification
function of scikit-learn.
import numpy as np
from sklearn import model_selection
from sklearn import metrics
from sklearn import datasets
import matplotlib.pyplot as plt
class Benchmark:
"""
This class allows to compare and evaluate the
performance of machine learning models using cross-validation
Parameters
----------
models : dict
Dictionary of models,
where the key is the name of the model and
the value is the model object.
"""
def __init__(self, models):
self.models = models
def test_models(self, X=None, y=None, cv=5):
"""
Test the models using the provided data and cross-validation.
Parameters
----------
X : array-like or DataFrame, shape (n_samples, n_features)
Features for the test data.
y : array-like or Series, shape (n_samples,)
Target for the test data.
cv : int, cross-validation generator or an iterable, optional
Number of folds for the cross-validation.
Returns
-------
best_model : str
Name of the model with the highest score.
"""
if X is None or y is None:
X, y = datasets.make_classification(
n_samples=100,
n_features=10,
n_classes=2,
n_clusters_per_class=1,
random_state=0
)
self.results = {}
for name, model in self.models.items():
scores = model_selection.cross_val_score(model, X, y, cv=cv)
self.results[name] = scores.mean()
self.best_model = max(self.results, key=self.results.get)
return f"The best model is: {self.best_model} with a score of {self.results[self.best_model]:.3f}"
The main function of the class will be the test_models
function, which will accept test data and use cross-validation to test the models. The function will store the results in an instance-bound variable and return the model with the highest score through the various iterations of the cross-validation.
To display the results, we will add a function called plot_cv_results
to the class. This function will use Matplotlib to create a bar graph showing the average cross validation score for each model.
def plot_cv_results(self):
"""
Create a bar chart to visualize the cross-validation results.
Returns
-------
None
"""
plt.figure(figsize=(15,5))
x = np.arange(len(self.results))
plt.bar(x, list(self.results.values()), align='center', color ='g')
plt.xticks(x, list(self.results.keys()))
plt.ylim([0, 1])
plt.ylabel('Cross-Validation Score')
plt.xlabel('Models')
plt.title('Model Comparison')
for index, value in enumerate(self.results.values()):
plt.text(index, value, str(round(value,2)))
plt.show()
Finally, to use the class, we will instantiate the Benchmark
object by passing in the dictionary of models and calling the test_models
function with the test data. Next, we will use the plot_cv_results
function to display the results.
from sklearn import linear_model, ensemble
models = {
'logistic': linear_model.LogisticRegression(),
'randomforest': ensemble.RandomForestClassifier(),
'extratrees': ensemble.ExtraTreesClassifier(),
'gbm': ensemble.GradientBoostingClassifier()
}
benchmark = Benchmark(models)
print(benchmark.test_models())
benchmark.plot_cv_results()
And this is the result.

This way, we can easily compare and evaluate the performance of models and then choose the model that performs best for our specific problem.
In this example we used the make_classification
function to generate the toy data, but of course you can use any dataset you like.
Additionally, the Benchmark
class can be extended to include other features, such as the ability to save results to a file or test models across multiple datasets.
What are the next steps?
Following the usual machine learning pipeline, the next step will be to tune the hyperparameters of the best model (in this case ExtraTreesClassifier
). This if our features are to be considered definitive.
If they are not, an intermediate step would be to do feature selection / engineering, and repeat the Benchmarking step every time such features are changed.
Conclusion
The Benchmark
class we've created is just one example of how you can implement this technique in a project, but it can easily be adapted and customized to meet your project's specific needs.
The main benefit of using this approach is that it automates the process of comparing and evaluating models, which can save time and reduce human errors.
Recommended Reads
For the interested, here are a list of books that I recommended for each ML-related topic. There are ESSENTIAL books in my opinion and have greatly impacted my professional career.
- Intro to ML: Confident Data Skills: Master the Fundamentals of Working with Data and Supercharge Your Career by Kirill Eremenko
- Sklearn / TensorFlow: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurelien Géron
- NLP: Text as Data: A New Framework for Machine Learning and the Social Sciences by Justin Grimmer
- Sklearn / PyTorch: Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python by Sebastian Raschka
- Data Viz: Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Knaflic
Useful Links (written by me)
- Learn how to perform a top-tier Exploratory Data Analysis in Python: Exploratory Data Analysis in Python – A Step-by-Step Process
- Learn the basics of TensorFlow: Get started with TensorFlow 2.0 – Introduction to deep learning
- Perform text clustering with TF-IDF in Python: Text Clustering with TF-IDF in Python
If you want to support my content creation activity, feel free to follow my referral link below and join Medium's membership program. I will receive a portion of your investment and you'll be able to access Medium's plethora of articles on Data Science and more in a seamless way.
Code template
Here is the entire codebase
class Benchmark:
def __init__(self, models):
self.models = models
def test_models(self, X=None, y=None, cv=5):
if X is None or y is None:
X, y = datasets.make_classification(
n_samples=100,
n_features=10,
n_classes=2,
n_clusters_per_class=1,
random_state=0
)
self.results = {}
for name, model in self.models.items():
scores = model_selection.cross_val_score(model, X, y, cv=cv)
self.results[name] = scores.mean()
self.best_model = max(self.results, key=self.results.get)
return f"The best model is: {self.best_model} with a score of {self.results[self.best_model]:.3f}"
def plot_cv_results(self):
plt.figure(figsize=(15,5))
x = np.arange(len(self.results))
plt.bar(x, list(self.results.values()), align='center', color ='g')
plt.xticks(x, list(self.results.keys()))
plt.ylim([0, 1])
plt.ylabel('Cross-Validation Score')
plt.xlabel('Models')
plt.title('Model Comparison')
for index, value in enumerate(self.results.values()):
plt.text(index, value, str(round(value,2)))
plt.show()
from sklearn import linear_model, ensemble
models = {
'logistic': linear_model.LogisticRegression(),
'randomforest': ensemble.RandomForestClassifier(),
'extratrees': ensemble.ExtraTreesClassifier(),
'gbm': ensemble.GradientBoostingClassifier()
}
benchmark = Benchmark(models)
print(benchmark.test_models())
benchmark.plot_cv_results()