You Think 80% Means 80%? Why Prediction Probabilities Need a Second Look

How reliable are probabilities predicted by a machine learning model? What does a predicted probability of 80% mean? Is it similar to 80% chance of an event occurring? In this beginner friendly post, you'll learn the basics of prediction probabilities, Calibration, and how to interpret these numbers in a practical context. I will show with a demo how you can evaluate and improve these probabilities for better decision-making.
What do prediction probabilities represent?
Instead of calling model.predict(data)
, which gives you a 0 or 1 prediction for a binary classification problem, you might have used model.predict_proba(data)
. This will give you probabilities instead of zeroes and ones. In many data science cases this is useful, because it gives you more insights. But what do these probabilities actually mean?
A predicted probability of 0.8 means that the model is 80% confident that an instance belongs to the positive class. Let's repeat that: the model is 80% confident that an instance belongs to the positive class. So it doesn't mean: there is an 80% real-world likelihood of the event occurring. These are two different things. It's important to make this distinction, especially in use cases where probabilities drive the -for example in fraud detection, medical diagnoses, or risk assessment.
This touches my motivation to write this post: I noticed that many people easily make this mistake. As a data scientist, it can be important to explain this difference if your model is ‘guilty', because business stakeholders might assume that 80% prediction probability means an 80% real-world likelihood of the event occurring.
The problem with interpreting model confidence scores as probabilities
Why can it be a problem to interpret scores from predict_proba
as real-world likelihood?
Some models are overconfident: An overconfident model produces high confidence scores while not being right. Other models are underconfident: They produce low confidence scores while being right. If model confidence scores are not calibrated this can be misleading.
Another thing to keep in mind is that confidence scores are based on patterns learned from the training data. If the data is imbalanced or biased, the confidence scores might not reflect true probabilities in real-world scenarios.
Let's take a look at an example. It can happen that when your XGBoost model predicts 80% as probability, that in reality, the event occurs only 65% of the time when the model outputs 80%. Of course we would like to see that if the model predicted 80% for 100 cases, the event occurs in approximately 80 of the cases. Otherwise we can't trust the probabilities.
How can we determine if a model is well-calibrated, meaning that the model confidence scores match the true likelihood of the event? Let's take a look at calibration and ways to improve it.
Calibrating model confidence scores
First, we want to visualize the alignment between model confidence scores and true outcomes on the test set. It's super easy:
- Group prediction probabilities into bins, below I used 20 bins.
- Compute the fraction of positive cases for each bin. This corresponds with the true probability of the event occurring in this bin.
- Plot these true probabilities against the predicted probabilities.
And of course, if the model is perfectly calibrated, the points will lie on the diagonal line. For example: out of all cases in the 10% bin, the true probability (fraction of positives) is around 10%. Below you can see examples of a quite good calibrated XGBoost model, versus a not-so-perfect calibrated Naive Bayes model. These models are trained on the adult dataset.

Another way to check how well the model is calibrated is by using the Brier Score. This is also easy! It measures the mean squared difference between the predicted probabilities and the actual outcomes (so the lower the better):

If we calculate the Brier Score for the two models above, we get the following results:
Brier scores for adult dataset:
XGBoost: 0.10849946433956742
Naive Bayes: 0.1920520011951727
What we can conclude from the calibration plots, is that the calibration from the XGBoost model is quite good. The one for Naive Bayes is far from perfect, because the curve is deviating from the diagonal line, and the Brier Score is high (almost twice as high as the Brier Score for the XGBoost model). Let's continue with the Naive Bayes model to show how we can improve the calibration! There are different ways of improving it, in this post we will take a look at Platt Scaling and Isotonic Regression.
The calibration curve and Brier score are implemented in scikit-learn, you can import and create them by using the following code:
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss
from sklearn.naive_bayes import GaussianNB
# fit model on training data
model = GaussianNB()
model.fit(X_train, y_train)
# calculate predicted probabilities with pred_proba on the test set
probs = model.predict_proba(X_test)[:, 1]
brier_score = brier_score_loss(y_test, model_probs)
prob_true, prob_pred = calibration_curve(y_test, model_probs, n_bins=20, strategy='uniform')
Platt Scaling
Platt Scaling is a simple and effective method for calibrating predicted probabilities. It works by fitting a logistic regression model to the output of the uncalibrated model's probabilities. Specifically, it minimizes the log-loss on a validation set, ensuring that the calibrated probabilities better reflect the true likelihood of the events.
To apply Platt Scaling, you split your data into training and validation sets. The first step is to train your model on the training set and generate uncalibrated probabilities for the validation set. Then you can use these probabilities as input features to fit a logistic regression model that adjusts the predictions. This approach is particularly effective for models that produce continuous scores, such as SVMs or Naive Bayes. One note: Platt Scaling assumes a monotonic relationship between predicted probabilities and true outcomes, which might not always hold.
Here you can see the code for applying Platt Scaling, and the new calibration curve if we apply Platt Scaling to our Naive Bayes classifier on the adult dataset:
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
# first, make sure you have fitted your model on the train set
model = GaussianNB()
model.fit(X_train, y_train)
# apply Platt Scaling
# cv='prefit' makes sure that the method uses the already trained model
platt_model = CalibratedClassifierCV(model, method='sigmoid', cv='prefit')
platt_model.fit(X_train, y_train)
platt_probs = platt_model.predict_proba(X_test)[:, 1]

Isotonic Regression
Another common calibration technique is Isotonic Regression. This is a non-parametric technique for calibrating probabilities. Meaning that unlike Platt Scaling, it does not make assumptions, making it more flexible but also potentially prone to overfitting when you are working with a smaller dataset. This method creates a step-by-step function that adjusts the predicted probabilities so they align better with the actual outcomes. The adjustment ensures that the probabilities stay in order, meaning higher predictions will still represent a higher likelihood of the event happening compared to lower predictions.
To implement Isotonic Regression, you again split your data and train the base model on the training set. The predicted probabilities on the validation set are used as inputs to fit an isotonic regression model, which adjusts the probabilities. It tends to produce better calibration than Platt Scaling in cases where the true probability distribution is irregular, like in our example. But watch out with small datasets, because Isotonic Regression can introduce artifacts like sharp jumps or dips in the calibration curve.
Below again the code and calibration curve. You can clearly spot the jump and dip at mean predicted probability 0.6! Besides that, the curve looks nice.
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
# first, make sure you have fitted your model on the train set
model = GaussianNB()
model.fit(X_train, y_train)
# apply isotonic regression, with method='isotonic'
iso_model = CalibratedClassifierCV(model, method='isotonic', cv='prefit')
iso_model.fit(X_train, y_train)
iso_probs = iso_model.predict_proba(X_test)[:, 1]

Comparing Calibration Methods
If we combine all the plots of the Naive Bayes model on the adult dataset (uncalibrated model, Platt Scaling, and Isotonic Regression), to compare them, this is the result:

Looking at this plot, the Isotonic Regression calibration plot seems to fit best in this example. It only has this strange jump and dip at 0.6 mean predicted probability mentioned earlier. We can perform an extra check by calculating the Brier Scores:
Brier scores for adult dataset and Naive Bayes model:
Uncalibrated: 0.1920520011951727
Platt Scaling: 0.15621506274566171
Isotonic Regression: 0.12849532236356562
Indeed! Isotonic Regression has the best score.
You may have noticed that the uncalibrated XGBoost model had an even better Brier Score and calibration plot, and you are right. We could save ourselves the hassle of calibrating the results of the Naive Bayes model and go for XGBoost for this dataset! Of course, if you test this in real life on your own data, it's not guaranteed that this is the case