Boosting Algorithms in Machine Learning, Part I: AdaBoost

Introduction
In machine learning, boosting is a kind of ensemble learning method that combines several weak learners into a single strong learner. The idea is to train the weak learners sequentially, each trying to do its best not to repeat the mistakes its predecessor made, eventually building a strong ensemble.
In this article, we will learn about one of the popular boosting techniques known as Adaboost and show how elegantly it allows each weak learner to pass on their mistakes to the next weak learner to improve the quality of predictions eventually.
We will cover the following topics in this article:
- Understand what is an ensemble and its different types. That's where we will define boosting.
- Understand what is meant by a weak learner using an example since boosting involves fitting multiple weak learners.
- Understand why there was a need for boosting algorithms.
- Introduction to AdaBoost algorithm.
- Implement the AdaBoost algorithm for binary classification in Python and look at the individual weak learners as well as the final predictions with some interesting visualizations.
- Finally, we will wrap up.
In the opening sentence, I mentioned that boosting is a kind of ensemble learning that combines several weak learners into a single strong learner. Before digging in, we must be familiar with these terminologies first. So, let's get straight into it!
What is an Ensemble?
An ensemble is a collection of multiple base models that can be aggregated together in such a way that the quality of the final predictions is much better than what an individual base model could have produced by itself.
In terms of the diversity of base models, ensembles can be of two types:
- Homogeneous ensemble, in which the base learners are the same. For example, a random forest is an ensemble of multiple decision trees.
- Heterogeneous ensemble, in which the base learners are different from each other. For example, we can train a support vector machine (SVM), a decision tree, and a neural network (NN) on the same classification dataset and then aggregate their predictions to get the final result.
In terms of how the base learners are trained and aggregated together, ensemble learning can be of three types:
- Bagging (short for bootstrap aggregation), in which homogeneous base learners are trained in parallel on random sub-samples of the training data, and their predictions are combined using a deterministic aggregation method, e.g., majority voting for classification and numerical averaging for regression.
- Boosting (originally called hypothesis boosting), in which homogeneous base learners are trained sequentially on the same training data, in such a way that each base learner tries to correct the mistakes of the previous base learner. It also follows a deterministic aggregation approach except that the individual predictions are weighted according to the accuracy of the base learners.
- Stacking (short for stacked generalization), in which heterogeneous base learners are trained (often in parallel) on the same training data, but their predictions are not aggregated using a deterministic approach. Instead, a separate model, known as blender or meta learner is used for aggregation that takes the predictions from the base learners as inputs and generates the final prediction as its output.
What is a Weak Learner?
A weak learner is any model or algorithm that performs just slightly better than random guessing.
For example, let's say Santiago wants to try his luck betting in horse racing, and for this, he needs to maximize his winnings by using some logic (or a rule of thumb). Sophisticated ML people call it a model. Anyways, assuming that he knows very little about horse racing, he can either bet on a random horse or use some common sense.
The following figure shows some of the options for him to consider.

In this example, the latter two options are slightly better than random guessing but are overly simplistic, and not very helpful on their own. That's why these are called weak learners.
The question that arises from here is – "Can a weak learning algorithm that performs just slightly better than random guessing be boosted into an arbitrarily accurate strong learning algorithm?" [2]
Kearns and Valiant [2] were the first ones to pose this question and there were the following challenges for the researchers at that time:
- How to present the training data to the weak learners such that they generate the most useful rules of thumb (i.e., models).
- How to combine multiple rules of thumb (i.e., models) from different weak learners into a single, highly accurate prediction rule.
Since there can be multiple weak learners, one obvious choice is to use an ensemble method. But, which one? A good answer would be – more likely the one that would address the challenges we just discussed.
Boosting solves the above-mentioned challenges by combining low-accuracy rules of thumb to generate a highly accurate prediction rule. It is achieved by training the weak learners sequentially such that each weak learner tries to improve upon the mistakes of its predecessor.
Continuous research in this direction eventually led to the introduction of the AdaBoost algorithm by Freud and Schapire in 1995 [2], which is what we are going to discuss next.
AdaBoost
AdaBoost, short for Adaptive Boosting, is one of the earliest and most popular boosting methods. The main idea is to pay a bit more attention to the training instances that the predecessor underfitted [3]. This results in new predictors focusing more and more on the hard cases.
The first question that popped up in my head after reading this definition was: how does a weak learner know which training instances it should pay more attention to? And, how to even quantify attention so that it can be assigned to the training instances?