Classification Metrics: The Complete Guide For Aspiring Data Scientists
Supervised Machine Learning can be divided into two groups of problems: classification and regression. This article aims to be the definitive guide on classification metrics: so if you're an aspiring Data Scientist or if you're a junior one, you definitely need to read this.
First of all, you may also like to read the my guide on the 5 metrics you need to know to master a regression problem:
Mastering the Art of Regression Analysis: 5 Key Metrics Every Data Scientist Should Know
Secondly, let me tell you what you'll find here through a table of contents:
Table of Contents:
What is a classification problem?
Dealing with class imbalance
What a classification algorithm actually does
Accuracy
Precision and recall
F1-score
The confusion matrix
Sensitivity and specificity
Log loss (cross-entropy)
Categorical crossentropy
AUC/ROC curve
Precision-recall curve
BONUS: KDE and learning curves
As usual, you'll find Python examples to make the theory into practice.
What is a classification problem?
In a classification problem data are labeled into classes: in other words, our label values represent the class to which the data points belong.
There are two kinds of classification problems:
- Binary classification problems: in this case, the target values are labeled with a 0 or a 1.
- Multi-class problems: in this case, the label gets multiple values (0, 1, 2, 3, etc.), depending on the number of classes.
Let's visualize them. Firstly, let's create a binary classification dataset as follows:
import numpy as np
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
# Generate data
num_samples = 1000
X = np.random.rand(num_samples, 2) * 10 - 5
y = np.zeros(num_samples)
y[np.sum(X ** 2, axis=1) < 5] = 1
# Plot data
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
plt.xlabel('Feature')
plt.ylabel('Label')
plt.title('Binary Classification Dataset')
plt.show()

So, this is an example of a binary classification dataset: some datapoint belongs to the blue class some others to the red class. Now, it doesn't matter what these classes represent. They can be apples or pears, cars or trains. It doesn't matter. What it's important now is that we've visualized a binary classification problem.
Now, let's visualize a multi-class problem:
import numpy as np
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
# Generate data
num_samples = 1000
X = np.random.rand(num_samples, 2) * 10 - 5
y = np.zeros(num_samples, dtype=int)
y[np.sum(X ** 2, axis=1) < 2.5] = 1
y[np.logical_and(X[:, 0] > 2, np.abs(X[:, 1]) < 1)] = 2
y[np.logical_and(X[:, 0] < -2, np.abs(X[:, 1]) < 1)] = 3
# Plot data
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 1')
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 2')
plt.scatter(X[y==2, 0], X[y==2, 1], c='green', label='Class 3')
plt.scatter(X[y==3, 0], X[y==3, 1], c='purple', label='Class 4')
plt.xlabel('Feature')
plt.ylabel('Label')
plt.title('Multiclass Classification Dataset')
plt.legend()
plt.show()

So, here we've created a classification problem with data points belonging to 4 classes.
One issue with multi-class classification problems is understanding if all the classes matter. Let's see what we mean in the next paragraph.
NOTE:
in the case of a binary classification,classes can be named as 0-1.
But they can also be named as 1-2. So, there is no convention that
tells us we need to start from 0.
This is the same for the multi-class case. Classe can be named 0,1,2,3 as
well as 1,2,3,4.
Dealing with class imbalance
Consider the following dataset:
import numpy as np
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
# Class 1: blue
mean1 = [0, 0]
cov1 = [[1, 0], [0, 1]]
num_points1 = 7000
X1 = np.random.multivariate_normal(mean1, cov1, num_points1)
# Class 2: green
mean2 = [3, 3]
cov2 = [[0.5, 0], [0, 0.5]]
num_points2 = 2700
X2 = np.random.multivariate_normal(mean2, cov2, num_points2)
# Class 3: red
mean3 = [-3, 3]
cov3 = [[0.5, 0], [0, 0.5]]
num_points3 = 300
X3 = np.random.multivariate_normal(mean3, cov3, num_points3)
# Plot the data
plt.scatter(X1[:, 0], X1[:, 1], color='blue', s=1, label='Class 1')
plt.scatter(X2[:, 0], X2[:, 1], color='green', s=1, label='Class 2')
plt.scatter(X3[:, 0], X3[:, 1], color='red', s=1, label='Class 3')
plt.xlabel('Feature')
plt.ylabel('Label')
plt.title('Imbalanced Multiclass Classification Dataset')
plt.legend()
plt.show()

As we can see, we have a lot of blue spots and also a high number of green spots. The red spots, instead, are very few with respect to the others.
The question is: should we take into account the red spots? In other words: can we perform our ML analysis by deleting the red spots because are too few?
The answer is…it depends!
Generally, we can ignore the values belonging to one (or more) class(es) with fewer observations than the others. But in specific cases, we mustn't! And here's where domain knowledge comes into the game.
For example, if we're studying fraud detection in a bank firm, we expect fraud transactions to be rare with respect to standard transactions. This gives us an imbalanced dataset, meaning: we can't delete the values belonging to the class with fewer observations!
The same thing is if we're studying something in the medical field. In the case of rare diseases, we expect them to be….rare! So, an unbalanced dataset is what we expect.
Anyway, we created the datasets above on purpose for educational scopes. Generally speaking, it's very hard to visualize the data points because we have more than one feature. So, a way to evaluate class imbalance is to display a histogram of the labels.
Before going on…if you don't know the difference between a histogram and a bar plot, you can read the following article I wrote:
So, here's what we can do. Let's create a dataset with three labels like the following:
import pandas as pd
import numpy as np
# Create a list of labels
labels = ['1', '2', '3']
# Create a list of features
features = ['feature_1', 'feature_2', 'feature_3']
# Set the number of samples
num_samples = 1000
# Create an empty Pandas DataFrame to store the data
data = pd.DataFrame()
# Add the features to the DataFrame
for feature in features:
data[feature] = np.random.rand(num_samples)
# Add the labels to the DataFrame
data['label'] = np.random.choice(labels, num_samples)
Even if this data frame is created on purpose, it reflects real cases because it's tabular (meaning we can manipulate it with pandas). So, if we show the head we get:

So, to understand if our dataset may be imbalanced or not we plot a histogram like so:
import seaborn as sns
import matplotlib.pyplot as plt
# Plot histogram
sns.histplot(data=data, x='label')
# Write title and axis labels
plt.title('CLASSES FREQUENCIES', fontsize=14) #plot TITLE
plt.xlabel('Our labels (our classes)', fontsize=12) #x-axis label
plt.ylabel('Frequencies of the three classes', fontsize=12) #y-axis label

Well, in such cases, the three classes have the same frequency. So the dataset is well-balanced and we must consider all the labels in our analyses.
Instead, this is how class imbalance is represented via a histogram:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Create a list of labels with class imbalance
labels = ['1'] * 500 + ['2'] * 450 + ['3'] * 50
# Create a list of features
features = ['feature_1', 'feature_2', 'feature_3']
# Shuffle the labels
np.random.shuffle(labels)
# Create an empty Pandas DataFrame to store the data
data = pd.DataFrame()
# Add the features to the DataFrame
for feature in features:
data[feature] = np.random.rand(len(labels))
# Add the labels to the DataFrame
data['label'] = labels
# Plot histogram
sns.histplot(data=data, x='label')
# Write title and axis labels
plt.title('CLASSES FREQUENCIES', fontsize=14) #plot TITLE
plt.xlabel('Our labels (our classes)', fontsize=12) #x-axis label
plt.ylabel('Frequencies of the three classes', fontsize=12) #y-axis label

So, in cases like that, we need to understand if class 3 has to be taken into account (we're studying "rare situations") or not (we're studying "situations with no rare events") so that we can drop all the values associated with that.
Now, before diving into the metrics we need to know to solve a classification problem, we need to understand what a classification algorithm actually does.
What a classification algorithm actually does
As we know, we use Machine Learning to make predictions. This means that we train an ML model on the available data, expecting the predictions to be as near as possible to the actual data.
If you don't know what "training an ML model" actually means, you can read my article here:
So, let's consider a binary classification problem. Our ML model gets the features as input and predicts if the data points belong to class 1 or to class 2. If the predictions "are perfect", it means that our model tells us precisely which of the available data belongs to class 1 and which to class 2, with 0 errors. So, all the actual points belonging to class 1 are predicted to belong to class 1 by our ML model.
Of course, as you imagine, a 0% error is not possible, and this is why we need some metrics to evaluate our ML models.
So before diving into the metrics, we need to use some nomenclature:
- We define a True Positive (TP) as a data point belonging to a class that is predicted to belong to that class. For example, if the model predicts that an email is spam, and it is indeed spam, then that is a true positive.
- We define a True Negative (TN) as a data point not belonging to a class that is predicted to not belong to that class. For example, if the model predicts that an email is not spam, and it is indeed not spam, then that is a true negative.
- We define a False Positive (FP) as a data point belonging to a class that is predicted to belong to another class. For example, if the model predicts that an email is spam, but it is actually not spam, then that is a false positive.
- We define a False Negative (FN) as a data point not belonging to a class that is predicted not to belong to that class. For example, if the model predicts that an email is not spam, but it is actually spam, then that is a false negative.
Generally speaking, as you may imagine, we want to minimize false positives and false negatives while maximizing true positives and true negatives, to make the model as accurate as possible. This means that our ML model makes accurate predictions.
But what does "accurate" mean? We need to dive into our first classification metrics to understand it.
Accuracy
The first metric we take into account is accuracy. Let's see the formula:
So, accuracy is a measure of how often our ML model is correct in its predictions.
For example, let's say we have a dataset of emails that are labeled as either spam or not spam. We can use ML to predict whether new emails are spam or not. If the model correctly predicts that 80 out of 100 emails are spam, and correctly predicts that 90 out of 100 emails are not spam, then its accuracy would be:
This means that our model is able to correctly predict the class of an email 85% of the time. A high accuracy score (near 1) indicates that the model is performing well, while a low accuracy score (near 0) indicates that the model needs to be improved. However, accuracy alone may not always be the best metric to evaluate a model's performance, especially in imbalanced datasets.
This is understandable because the prevalent class has "more data" labeled to it, so if our model is accurate it will make accurate predictions according to the prevalent class. In other words, our model may be biased because of the prevalent class.
Let's make an example in Python creating a dataset for this purpose:
import numpy as np
import pandas as pd
# Random seed for reproducibility
np.random.seed(42)
# Create samples
n_samples = 1000
fraud_percentage = 0.05 # Fraudolent percentage
# Create classes
X = np.random.rand(n_samples, 10)
y = np.random.binomial(n=1, p=fraud_percentage, size=n_samples)
# Create data frame
df = pd.DataFrame(X)
df['fraudulent'] = y
We have created a simple data frame with 1000 samples that can represent the data of some credit card transactions, for example. We have, then, created a class for the fraudulent transaction which is the 5% of all the observations. So, this dataset is clearly imbalanced.
If our model is accurate it is because it's biased by the 95% of the observations that belong to the class that represent the non-fraud transactions. So let's split the data set, make predictions with the Logistic Regression model, and print the accuracy:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit logistic regression model to train set
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
>>>
Accuracy: 0.95
So, our model is 95% accurate: hooray! Now…let's define the other metrics and see what they tell us about this dataset.
Precision and recall
Precision measures the ability of a classifier to not label as positive a sample that is negative. In other words, it measures the fraction of true positives among all positive predictions. Simplifying, precision tells how accurate are the positive predictions of our model. That's the formula:
Considering an email spam classification problem, precision measures how many of the emails that the model classified as spam are actually spam.
Let's use it in our imbalanced dataset:
from sklearn.metrics import precision_score
# Calculate and print precision
precision = precision_score(y_test, y_pred)
print('Precision:', precision)
>>>
Precision: 0.0
Auch! 95% accuracy and 0% precision: what does it mean? It means that the model is predicting all samples as negative, or non-fraudulent. Which is wrong, of course. In fact, a high precision score would indicate that the model is correctly identifying a high proportion of fraudulent transactions among all transactions it predicts as fraudulent.
Then, we have the recall metric that measures the fraction of true positives among all actual positives. In other words, it measures how many of the actual positives are correctly predicted. Simplifying, recall tells us how well our model is able to find all the positive instances in our data. Here's the formula:
Considering an email spam classification problem, recall measures how many of the actual spam emails in the dataset are correctly identified as spam emails by our ML classifier.
Let's say that we have a dataset of 1000 emails, where 200 of them are spam and the rest are legitimate. We train a machine learning model to classify emails as spam or not spam, and it predicts that 100 of the emails are spam.
Precision would tell us how many of those 100 predicted spam emails are actually spam. For example, if 90 out of the 100 predicted spam emails are actually spam, then the precision would be 90%. This means that out of all emails that the model predicted as spam, 90% of them are actually spam.
Recall, on the other hand, tells us how many of the actual spam emails the model correctly identified as spam. For example, if out of the 200 actual spam emails, the model correctly identified 150 of them as spam, then the recall would be 75%. This means that out of all actual spam emails, the model correctly identified 75% of them as spam.
Now, let's use recall in our imbalanced dataset:
from sklearn.metrics import recall_score
# Calculate and print recall
recall = recall_score(y_test, y_pred)
print('Recall:', recall)
>>>
Recall: 0.0
Again: we have 95% of accuracy and 0% recall. What does it mean? As before, it means that the model is not correctly identifying any fraudulent transactions, and is instead predicting all transactions as non-fraudulent. In fact, a high recall score would indicate that the model is correctly identifying a high proportion of fraudulent transactions among all actual fraudulent transactions.
So, in practice, we want to achieve a balance between precision and recall depending on the problem we're studying. To do so, we often refer to other two metrics that consider both of them: the confusion matrix and f1-score. Let's see them.
F1-score
F1-score is an evaluation metric in Machine Learning that combines precision and recall into a single value in the range 0–1. If f1-score results in a 0 value, then our ML model has low performance. If f1-score results in a 1 value, then our ML model has high performance.
This metric balances precision and recall by calculating their harmonic mean. This is a type of average that is more sensitive to low values, and this is why this metric is particularly suitable for imbalanced datasets.
Let's see its formula:
Now, we know the results we'll gain for our imbalanced dataset (f1-score will be 0). But let's see how to use it in Python:
from sklearn.metrics import f1_score
# Calculate and print f1-score
f1 = f1_score(y_test, y_pred)
print('F1 score:', f1)
>>>
F1 score: 0.0
In the context of a spam classifier, let's say we have a dataset of 1000 emails, where 200 of them are spam and the rest are legitimate. We train a machine learning model to classify emails as spam or not spam, and it predicts that 100 of the emails are spam.
To calculate the F1-score of the spam classifier, we first need to calculate its precision and recall. Let's say that out of the 100 predicted spam emails, 80 are actually spam. So, the precision is 80%. Also, let's say that out of the 200 actual spam emails, the model correctly identified 150 of them as spam. So, the recall is 75%.
Now we can calculate the f1-score:
Which is a pretty good result as we're near 1.
The confusion matrix
The confusion matrix is a table that summarizes the performance of a classification model by showing the number of true positives, false positives, true negatives, and false negatives.
In a binary classification problem, the confusion matrix has two rows and two columns and it's displayed like so:

Using the spam email classification example, let's say that our model predicted 100 emails as spam, out of which 80 were actually spam, and 900 emails as not spam, out of which 20 were actually spam.
The confusion matrix for this example would look like that:

Now, this is a very useful visualization tool for classification for two reasons:
- It can help us calculate precision and recall by visualizing it
- It immediately tells us what matters, without any calculations. What we want in a classification problem, in fact, is TN and TP to be the highest possible while FP and FN to be the lowest possible (as much as near to 0). So, if the values on the main diagonal are high and the values on the other positions are low, then our ML model has good performance.
This is the reason why I love the confusion matrix: we just need to watch the main diagonal (from top-left to low-right) and non-diagonal values to evaluate the performance of an ML classifier.
Considering our imbalanced dataset, we obtained 0 for precision and recall and we said that it means that the model is not correctly identifying any fraudulent transactions, and is instead predicting all transactions as non-fraudulent.
This may be really difficult to visualize, because of the formulas of precision and recall. We have to have them clear in our minds. Since it's not easy for me to have this kind of visualization, let's apply the confusion matrix to our example and see what happens:
from sklearn.metrics import confusion_matrix
# Calculate and print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix:n', cm)
>>>
Confusion matrix:
[[285 0]
[ 15 0]]
See what happens?! We can clearly say that our model is not performing well because, while it captures 285 TNs it captures 0 TPs! That's the visual power of the confusion matrix!
There is also another way to display the confusion matrix, and I really love it because it improves the visualization experience. Here's the code:
from sklearn.metrics import ConfusionMatrixDisplay
# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot confusion matrix
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

This kind of visualization is very useful in the case of multi-class classification problems. Let's see one example:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Generate random data with 3 classes
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
n_clusters_per_class=1, n_informative=5,
class_sep=0.5, random_state=42)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
from sklearn.linear_model import LogisticRegression
# Train a logistic regression model on the training data
clf = LogisticRegression(random_state=42).fit(X_train, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test)
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['Class 0', 'Class 1', 'Class 2'])
disp.plot()

In these cases is not easy to understand what are the TPs, the TNs, and so on because we have three classes. Anyway, we can simply refer to the values on the main diagonal and to the non-diagonal ones. In this case, on the main diagonal, we have 49, 52, and 44 which are values much higher than the non-diagonal ones, telling us that this model is performing well (also note we've calculated the confusion matrix on the test set!).
Sensitivity and specificity
There are a couple of metrics that, in my personal opinion, are more suitable if used in some particular cases: sensitivity and specificity. Let me talk about them, and then we'll discuss the usability in particular cases.
Sensitivity is the ability of a classifier to find all the positive samples:
Wait a second! But isn't it the recall?!?
Yes, it is. It's not a mistake. This is why I'm telling you that these metrics are more suitable for particular cases. But let me go on.
We define specificity as the ability of a classifier to find all the negative samples:
So, both of them, describe the "precision" of a test: sensitivity describes the probability of a positive test. Specificity of a negative one.
In my experience, these metrics are more suitable for classifiers used in the medical field, biology, and so on.
For example, let's take into account a COVID test. Consider this approach (which can be considered Bayesian, but let's skip that): you make a COVID test and the result is positive. Question: what's the probability to get a positive test? And what's the probability to get a negative test?
In other words: what are the sensitivity and the specificity of the tool you used to get the result?
Well, you may ask yourself: what kind of question are you asking, Federico?
Let me make an example I lived last summer.
Here in Italy, a positive COVID test had to be certified by someone (let's skip the reasons for that): a hospital or a pharmacy, typically. So, when we had the symptoms what we generally did here was test for COVID at home (3-5€ COVID test), then go to a pharmacy and confirm (15€ COVID test).
So, last July I had symptoms after my wife and daughters tested positive. So I tested home and resulted positive. Then, immediately went to the pharmacy to confirm, and…resulted negative!
How is that possible? Easy: the tool I used at home for the COVID test was more sensitive than the other used by the pharmacist (or, the test used by the pharmacist was more specific than the one I used).
So, as per my experience, these metrics are particularly suitable for measuring instruments of any kind (mechanical, electrical, etc…) and/or in some particular fields (like biology, medicine, etc…). Also, remembering that those metrics use TP, TN, FP, and FN as precision and recall: this stresses again the fact that these are more suitable in the case of a binary classification problem.
Of course, I'm not telling you that sensitivity and specificity must be used only in the above-mentioned cases. They're just more suitable, in my experience.
Log loss (cross-entropy)
Log loss – sometimes called cross-entropy – is an important metric in classification, and is based on probability. This score compares the predicted probability for each class to the actual class labels.
Let's see the formula:
Where we have:
n
is the total number of observations, andi
is a single observation.y
is the true value.p
is the predicted probability.Ln
is the natural logarithm.
To calculate the predicted probability p
, we need to use an ML model that can actually calculate probabilities, like Logistic Regression, for example. In this case, we need to use the predict_proba()
method like so:
from sklearn.linear_model import LogisticRegression
# Invoke logistic regression model
model = LogisticRegression()
# Fit the data on the train set
model.fit(X_train, y_train)
# Calculate probabilities
y_prob = model.predict_proba(X_new)
So, suppose we have a binary classification problem and suppose we calculate the probabilities via the Logistic Regression model, and suppose the following table represents our results:

The calculation we'd perform to obtain the Log Loss is as follows:
And this results in a value near 0 that can make us satisfied, meaning our Logistic Regression model is predicting quite well the labels for each class. In fact, a Log Loss with a value of 0 represents the best fit possible. In other words, a model with a Log Loss of 0 predicts each observation's probability as the true value.
But, don't be scared: we don't need to calculate the value of Log Loss by hand. Luckily for us, sklearn
came into help. So, let's return to our imbalanced dataset. To calculate Log Loss in Python we type the following:
from sklearn.metrics import log_loss
# Invoke & print Log Loss
log_loss_score = log_loss(y_test, y_pred)
print("Log loss score:", log_loss_score)
>>>
Log loss score: 1.726938819745535
Again, we got a bad metric on the test set, confirming all of the above.
Finally, one last consideration: Log Loss is suitable for binary classification problems. How about multi-class problems?
Categorical cross-entropy
The categorical cross-entropy metric represents the generalization of the Log Loss to the multi-class case.
This metric is particularly suitable for imbalanced datasets because it takes into account the probability of the predicted class. This is important when we have an imbalanced dataset because the relative frequency of the classes can influence the ability of the model to correctly predict the "minority" classes.
Here we have:
Where the nomenclature is the same as for the Log Loss case.
Finally, in Python, we use it the same way we do with Log Loss so by invoking from sklearn.metrics import log_loss
. So, this discussion was just to stress the fact that there is a slight difference in the case of a binary classification or in the case of a multi-class classification.
AUC/ROC curve
ROC stands for "Receiver Operating Characteristic" and is a graphical way to evaluate a classifier by plotting the true positive rate (TPR) against the false positive rate (FPR) at different thresholds.
AUC, instead, stands for "Area Under Curve" and represents the area under the ROC curve. So this is an overall performance method, ranging from 0 to 1 (where 1 means the classifier predicts 100% of the labels as the actual values), and it's more suitable when comparing different classifiers.
Firstly, let's define TPR and FPR:
- TPR is the sensitivity (which can also be called recall, as we said).
- FPR is defined as
1-specificity
.
Note that AUC/ROC is suitable in the case of a binary classification problem. In the case of a multi-class classifier, in fact, TPR and FPR should be revisited. This requires some work to do, so here my advice is to use it just in the case of a binary classification problem.
Now, let's see how to implement this in Python:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# Generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Fit a logistic regression model on the training data
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict probabilities for the testing data
probs = model.predict_proba(X_test)
# Compute the ROC curve and AUC score
fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1])
auc_score = roc_auc_score(y_test, probs[:, 1])
# Plot the ROC curve
plt.plot(fpr, tpr, label='AUC = {:.2f}'.format(auc_score))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

The dotted line represents a purely random classifier (which is like randomly guessing a class rather than another. And, in fact, since this is a binary classification problem, the line has a slope of 0.5, meaning we have a 50% chance to guess it right). So, the more our curve is far from it, the more our model is a good one. Ideally, our curve should stay as much as possible on the top-left corner meaning a low False Positive Rate with a high True Positive Rate.
This is why this graph is good to compare models: better models have curves near the top-left corner of the graph. Let's see an example: we'll use the same dataset as before, but we'll fit the data to three different ML models.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# Generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Fit three different classifiers on the training data
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100)
clf3 = KNeighborsClassifier(n_neighbors=5)
clfs = [clf1, clf2, clf3]
# Predict probabilities for the testing data
plt.figure(figsize=(8,6))
for clf in clfs:
clf.fit(X_train, y_train)
probs = clf.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, probs[:,1])
auc_score = roc_auc_score(y_test, probs[:,1])
plt.plot(fpr, tpr, label='{} (AUC = {:.2f})'.format(clf.__class__.__name__,
auc_score))
# Plot the ROC/AUC curves for each classifier
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc="lower right")
plt.show()

So, in this case, the Random Forest classifier is the one that predicts better our data because its curve lies on the top-left corner at values higher than the other models.
To conclude this section, let me remind you that, at the beginning of this paragraph, we said that ROC plots TPR against the FPR at different thresholds, but we haven't specified anything else. So, let's do so in the next paragraph.
Precision-recall curve
Consider a binary classification problem. We fit the data to a classifier and it assigns any predicted value to class 1 or to class 0: what are the criteria used for the assignation?
Stop reading for a bit and try to think about that.
Yes, you guessed it right: in classification problems, a classifier assigns a score between 0 and 1 to each sample. This indicates the probability that the sample belongs to the positive class.
So, our ML models use a threshold value to convert the probability scores into class predictions. In other words, any sample with a probability score greater than the threshold is predicted as positive, for example.
Of course, this is true even in the case of a multiclass classification problem: we've used the case of a binary classification just to simplify our reasoning.
So, ROC curves are useful because they show how the performance of an ML model varies at different threshold values.
Anyway, the fact that a classifier assigns the predicted value to a class based on a threshold tells us that precision and recall are a trade-off (just like bias and variance).
Also, we can even plot the precision-recall curve. Let's see how to do so, using the same dataset we used for the AUC/ROC curve:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Fit a logistic regression model on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)
# Predict probabilities for the testing data. Compute precision-recall curve
probs = clf.predict_proba(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, probs[:,1])
# Plot the precision-recall curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

So, above we can see that precision maintains the value of 1 until circa 0.5 recall, then it falls dramatically fast. So, we'd like to choose a precision-recall trade-off before this value. Let's say at 0.4 recall.
Another great way to visualize this tradeoff is to plot precision vs recall as the threshold varies. Using the same dataset, this is what happens:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Fit a logistic regression model on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)
# Predict probabilities for the testing data. Compute precision-recall curve
probs = clf.predict_proba(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, probs[:,1])
# Plot precision and recall as thresholds change
plt.plot(thresholds, precision[:-1], label='Precision')
plt.plot(thresholds, recall[:-1], label='Recall')
plt.xlabel('Threshold')
plt.ylabel('Precision & Recall')
plt.legend()
plt.title('Precision and Recall as Thresholds Change')
plt.show()

So, the above plot confirms that the threshold that balances the precision-recall trade-off is around 0.4, in this case.
So, when someone is telling you that found an ML model with 95% precision you should ask: "At what recall?"
Finally, since they use quite the same metrics, you may be wondering when we should use the AUC/ROC curve and when the precision-recall one. Quoting from reference 1 (page. 92):
As a rule of thumb, you should prefer precision.recall curve whenever the positive class is rare, or when you care most about the false positives than the false negatives, and ROC curve otherwise
Bonus: KDE and learning curves
Among all the methods and metrics we've seen above that are specific to the classification cases, there are two that are transversal. Meaning they can be used both for evaluating classification and regression problems.
There are the KDE plot and learning curves. I've written about them in previous articles, so I'll link them below:
You find what a KDE is and how to use it at point 3 in the paragraph "Graphical methods to validate your ML model" of the following article:
Mastering Linear Regression: The Definitive Guide For Aspiring Data Scientists
You can read about what learning curves are and how to use them here:
Conclusions
So far, we've seen a lot of metrics and methodologies to validate a classification algorithm. If you're wondering which one to use, I always say that, while making experience with each one (especially, comparing them) is a good practice, it's difficult to answer the question, for a lot of reasons. Often, it's just a question of taste.
Also, using just one metric to evaluate an ML model is not sufficient, and this is a rule of thumb.
If you read other articles from me, you know that I personally love to use at least one analytical method and one graphical one. In the case of classification problems, I generally use the confusion matrix and the KDE.
But, again: it's a matter of personal taste. My advice here is to practice with them and decide which ones you like, remembering that you'll need more than one to make accurate decisions on your ML models.
FREE PYTHON EBOOK:
Started learning Python Data Science but struggling with it? Subscribe to my newsletter and get my free ebook: this will give you the right learning path to follow to learn Python for Data Science with hands-on experience.
Enjoyed the story? Become a Medium member for 5$/month through my referral link: I'll earn a small commission to no additional fee to you:
Bibliography and references:
- [1] Hands-on Machine Learning with Scikit-Learn & Tensorflow – Aurelien Gueron
- [2] Machine Learning with PyTorch and Scikit-learn – Sebastian Raschka, Yuxi Liu, Vahid Mirialili