Boosting Algorithms in Machine Learning, Part II: Gradient Boosting

Author:Murphy | View: 28048 | Time: 2025-03-22 19:42:25

In this article, we will learn about gradient boosting, a machine learning algorithm that lays the foundation of popular frameworks like XGBoost and LightGBM which are award-winning solutions for several machine learning competitions.

This is a great article for anyone thinking about using ensembles in machine learning or a great refresher for someone who is already a pro but just wants to take a break from dot fits and dot predicts and wants a look under the hood!

We'll cover the basics of ensemble learning and explain how the Gradient Boosting algorithm makes predictions with a step-by-step example. We'll also explore the relationship between gradient descent and gradient boosting and find out if there is any connection. Let's get started!

Ensemble Learning

Ensemble learning is the process of training and combining multiple models, often weak learners, to create a strong learner with a higher predictive power. Two ways to do this are bagging and boosting.

1. Bagging

Bagging, short for bootstrap aggregation, consists of training multiple weak learners independently on different subsets of the training data sampled with replacement i.e., bootstrapping. The final prediction is then obtained by averaging each model's individual predictions for regression or taking a majority vote for classification. Random forests, for example, use bagging to train multiple decision trees on different subsets of the data.

Bagging reduces variance, thus making the final ensemble less prone to overfitting. If you're curious to learn more about how bagging helps prevent overfitting in decision trees, check this article out.

2. Boosting

Boosting involves training multiple models sequentially such that each model learns from its predecessor's mistakes, so it (hopefully) doesn't repeat the same mistakes.

Boosting focuses on reducing bias rather than variance and creating the final model by "boosting" weak learners iteratively.

Boosting first began with the concept of focusing on hard-to-predict samples from the training dataset, that is the main idea of AdaBoost. It adjusts the weights of samples based on whether or not they were misclassified by the preceding model. The samples with adjusted weights are then forwarded to the next model, thus reducing the overall error. AdaBoost is sensitive to noisy data because there is a risk of potentially overfitting the outliers by assigning them higher weights. I have explained AdaBoost in detail in this article with the help of an example in Python.

Gradient Boosting was introduced to overcome the limitations of AdaBoost. Instead of re-weighing the samples, gradient boosting focuses on the residual errors i.e., gradients of the current model. Each new weak learner (usually a decision tree) is trained to minimize these residuals, improving the overall model accuracy. Let's dig more into the details of how it works.

Gradient Boosting

Gradient Boosting is machine learning technique that sequentially builds a strong ensemble of models by combining multiple weak learners, typically decision trees. It does so by fitting a new weak learner to the residual errors (i.e., difference between actual and predicted values) made by the previous weak learner.

In gradient boosting, each model corrects the prior model's errors to minimize a loss function such as mean squared error for regression or log loss for classification. The predictions from each model are scaled by a learning rate and combined to create an ensemble.

Let's look at the steps followed by a gradient boosting model given below:

Process of building a gradient boosting model (Image created by author using napkin.ai)

Assume that we are predicting house prices based on a bunch of features such as size, number of rooms, locality, etc. and we have just three samples in our training data as shown below where actual prices are our target variable.

Example of house price prediction, consider three samples in the training set (Image by author)

Initialize Predictions: Before training any model, we will start with a baseline prediction for all samples which usually is the average of our target variable across the training set. In our example of three houses, the average price is $226,000, which is used to initialize the predictions.

Step 1: Intializing baseline predictions using average value over training data (Image by author)

2. Calculate Residuals: After getting a prediction for each of our samples, the next step is to compute the residual i.e., the difference between the actual values and the predictions. These residuals represent the error of our baseline predictions.

Step 2: Computing residuals (Image by author)

3. Train Weak Learner: In gradient boosting, after calculating the residuals from the initial baseline predictions, the next step is to train a weak learner (such as a simple decision tree) specifically on these residuals, rather than on the actual target values (i.e., the house prices). So the input to the decision tree will be house features and the target will be residuals of the previous step's predictions (for the first model, the residuals from baseline predictions are used).

Step 3: The weak learner is trained to learn the residuals from the input features (Image by author)

Let's assume that the weak learner gives the predictions -$50K, -$20K, and $80K for each house respectively. Note that these predictions don't match the residuals exactly but aim to reduce the overall error, it will be clear how, in just a second.

Step 3: The weak learner predictions on the residuals are used as error correction to update the previous iteration's predictions (Image by author)

The weak learner's prediction on the residuals serves as our Error Correction – it represents the adjustments needed to reduce the current errors in the next step.

4. Update Predictions: After the weak learner is trained to predict the residuals, we update the baseline predictions by adding a scaled version of the weak learner's predictions i.e., the error correction. This scaling factor called the learning rate (denoted by α), is a crucial hyperparameter that controls how much each weak learner contributes to the overall model. In this case, let's assume a learning rate of 0.1. The general update rule for each iteration is:

Here y_hat represents predictions. Using this rule, let's update the predictions for each of the given houses:

Step 4: Illustration of how a current prediction is updated using the weak learner's estimate of the residuals scaled by a learning rate (Image by author)

We notice that each house's prediction has been adjusted slightly closer to the actual prices, thanks to the scaled correction from the weak learner. This illustrates the important role of the learning rate in controlling the magnitude of each adjustment.

How does the learning rate affect the predictions? A higher learning rate would result in larger updates, which could lead to "overshooting" the actual values and cause the model to fluctuate rather than converge smoothly. On the other hand, a very low learning rate would make only small adjustments, requiring more iterations to reach accurate predictions and potentially slowing down the learning process.

5. Repeat the process: After updating predictions with the first weak learner, we repeat steps 2 through 4 over multiple iterations. It consists of recalculating the residuals, training another weak learner on new residuals, and updating the previous iteration's predictions using the given update rule. Each iteration refines the model further, bringing predictions closer to the actual values.

While repeating the process is essential for refining the model, it doesn't mean continuing forever! We need to stop at some point, and this decision can be based on several factors:

Fixed Number of Iterations/estimators: Often, we set a predefined number of iterations as a hyperparameter, determining in advance how many rounds of boosting will take place. Each iteration of the algorithm trains one new weak learner that focuses on correcting the residual errors left by the previous learners.
Error Threshold: Training can also stop once the residuals or overall error reach a sufficiently small value, indicating that the predictions are very close to the actual values.
No Significant Improvement: Another common stopping point is when the model's performance on a validation set stops improving. This helps to avoid overfitting by ending the training process before the model becomes too complex.

Inference time! What happens when we want to predict the price of a brand-new house?

Let's say we've created 5 weak learners in our model (in fact, this number is a hyperparameter denoted by _nestimators we set when using GradientBoostingRegressor in scikit-learn). When a new house comes in and we don't know its price, our gradient boosting model begins with a baseline prediction, usually the average price from the training data.

Next, we pass this new house through each of our trained weak learners in the order they were built. Each learner has learned something unique about the patterns in the data, so one by one, they add their own small correction to the prediction, and that's how we can arrive at a final estimate of the predicted price!

Following is an illustration of what we just discussed.