Linear Regression In Depth (Part 2)

In the first part of this article we formally defined the Linear Regression problem and showed how to solve simple linear regression problems, where the data set contains only one feature. In the second part of the article, we will discuss multiple linear regression problems, where the data set may contain any number of features.
We will first generalize the closed-form solution we have found for simple linear regression to any number of features. Then we will suggest an alternative approach for solving linear regression problems that is based on gradient descent, and discuss the pros and cons of this approach vs. using the closed-form solution. In addition, we will explore the classes in Scikit-Learn that implement both approaches, and demonstrate how to use them on a real-world data set.
Multiple Linear Regression Definition
Recall that in regression problems we are given a set of n labeled examples: D = {(x₁, _y_₁), (x₂, _y_₂), … , (xₙ, yₙ)}, where xᵢ is an m-dimensional vector containing the features of example i, and yᵢ is a real value that represents the label of that example.
In linear regression problems, we assume that there is a linear relationship between the feature vector x and the label y, so our model hypothesis takes the following form:

Our goal is to find the parameters w of this model that will minimize the sum of squared residuals:

In the previous part of the article, we have shown how to find the optimal w for the case of m = 1 using the normal equations. We will now extend these equations for any number of features m.
To simplify the derivation of the normal equations for the general case, we first define a matrix X that contains the values of all the features in the data set, including the intercept terms:

This matrix is called the design matrix. Each row in the design matrix represents an individual sample, and the columns represent the explanatory variables. The dimensions of the matrix are n × (m + 1), where n is the number of samples and m is the number of features.
In addition, we define the vector y as an n-dimensional vector that contains all the target values:

These definitions allow us to write the least squares cost function in the following matrix form:

Proof:
We first note that:

The dot product of a vector with itself uᵗu is just the sum of all its components squared, therefore we have:

Closed-Form Solution
As was the case in simple linear regression, the function J(w) is convex, hence it has a single local minimum, which is also the global minimum. In order to find this global minimum, we need to compute the gradient of J(w) with respect to w and set it to zero.
The gradient of J(w) with respect to w is:

During this proof we have used some basic rules of matrix calculus, which are explained in this article:
We now set this gradient to zero in order to obtain the normal equations:

Therefore, the optimal **w*** that minimizes the least squares cost function is:

Note that we assumed here that the columns of X are linearly independent (i.e., X has a full column rank), otherwise XᵗX is not invertible, and there is no unique solution for **w***.
When the columns of X are linearly dependent, we call this phenomenon multicollinearity. Mathematically, a set of variables is perfectly multicollinear if for all samples i:

where λₖ are constants, and xᵢₖ is the value of feature k in sample i.
In practice, perfect multicollinearity is rare (e.g., it can be caused by accidentally duplicating a variable in the data). However, even lesser degrees of multicollinearity, where two or more features are highly correlated with each other (but not perfectly correlated), can cause issues both when fitting the model (the coefficients become very sensitive to small changes in the data) and when interpreting the results (it becomes hard to identify which features have the most impact on the model's predictions).
The interested reader can find more info about the multicollinearity problem and how to handle it in this Wikipedia article.
Multiple Linear Regression Example
To demonstrate the usage of the closed-form solution, let's build a linear regression model for the California housing data set, available from the sklearn.datasets module. The goal in this data set is to predict the median house value of a given district (house block) in California, based on 8 different features of that district, such as the median income or the average number of rooms per household.
We first import the required libraries, and initialize the random seed in order to have reproducible results:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(0)
Then, we fetch the data set:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names
To explore the data set, we merge the features (X) and the labels (y) into a Pandas DataFrame, and display the first rows from the table:
mat = np.column_stack((X, y))
df = pd.DataFrame(mat, columns=np.append(feature_names, 'MedianValue'))
df.head()

We can further investigate the data set by calling the DataFrame's info() method, which provides information about the type of the columns and whether they contain any missing values:
df.info()

Luckily, this data set contains only numerical features and has no missing values. Therefore, no data preprocessing is required here (the closed-form solution does not require normalization of the data).
Next, we need to append a column of 1s to the matrix _Xtrain in order to represent the intercept terms. This can be easily done with the function np.column_stack():
X_with_bias = np.column_stack((np.ones(len(X)), X))
We now split the data into 80% training set and 20% test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, test_size=0.2, random_state=0)
Let's now write a general function to find the optimal **w*** for any given data set, using the closed-form solution we have found earlier:
def closed_form_solution(X, y):
w = np.linalg.inv(X.T @ X) @ X.T @ y
return w
The closed-form solution can be implemented in a single line of code!
Let's use this function to find the optimal **w*** for our training set:
w = closed_form_solution(X_train, y_train)
print(w)
The optimal **w*** is:
[-3.68585691e+01 4.33333407e-01 9.29324337e-03 -9.86433739e-02
5.93215487e-01 -7.56192502e-06 -4.74516383e-03 -4.21449336e-01
-4.34166041e-01]
The first component in this vector is the intercept (_w_₀), and the rest are the coefficients of the eight features in the data set.
Let's now evaluate the model on the training and the test sets. It is important to evaluate your model on both of them, because a large discrepancy between the training and the test scores may indicate that your model is overfitting.
We start by finding the _R_² score on the training set. To that end, we first get the predictions of the model on the training examples by multiplying the matrix _X_trainb by the vector w:
y_train_pred = X_train @ w
We now use the r2_score() function from sklearn.metrics to find the _R_² score on the training set:
from sklearn.metrics import r2_score
train_score = r2_score(y_train, y_train_pred)
print(f'R2 score (train): {train_score:.4f}')
The score we get is:
R2 score (train): 0.6089
Let's do the same on the test set:
y_test_pred = X_test @ w
test_score = r2_score(y_test, y_test_pred)
print(f'R2 score (test): {test_score:.4f}')
The score we get is:
R2 score (test): 0.5943
The score is not high, which indicates that the relationship between the features and the label might not be linear. In such cases, non-linear regression models, such as regression trees or k-nearest neighbors can provide better results.
Exercise
Let's say we accidentally duplicated every point in the data set and then ran linear regression again. How will this affect the weights of the model? Hint: Think what will happen to the design matrix X and the labels vector y, and how these changes will affect the normal equations.
The solution can be found at the end of this article.
Linear Regression in Scikit-Learn
Scikit-Learn provides a class named LinearRegression that also implements the closed-form solution to the ordinary least squares problem.
By default, this class automatically adds a column of 1s to the design matrix, so you do not need to add it manually as we did earlier (unless in the constructor you set the parameter _fitintercept to False). Therefore, we need to re-split the original data set (without the extra ones) into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Let's create an instance of the LinearRegression class and fit it to the training set:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
The fitted parameter vector w is stored in two attributes of this class:
- _coef__ is an array that contains all the weights except for the intercept
- _intercept__ is the intercept term (_w_₀)
Let's print them:
print(reg.intercept_)
print(reg.coef_)
The output is:
-36.858569106801234
[ 4.33333407e-01 9.29324337e-03 -9.86433739e-02 5.93215487e-01
-7.56192502e-06 -4.74516383e-03 -4.21449336e-01 -4.34166041e-01]
We get exactly the same coefficients as we did with the computation in NumPy.
The score() method of the LinearRegression class returns the _R_² score of the model. It only requires the matrix X and the vector y of the data set you want to get the score on (so no need to compute the model's predictions). For example, we can get the _R_² score on the training and the test sets as follows:
train_score = reg.score(X_train, y_train)
print(f'R2 score (train): {train_score:.4f}')
test_score = reg.score(X_test, y_test)
print(f'R2 score (test): {test_score:.4f}')
R2 score (train): 0.6089
R2 score (test): 0.5943
As expected, we get the same _R_² scores as before.
Analyzing the Regression Errors
In addition to evaluating the overall performance of the model, we often want to investigate the behavior of our regression errors. For example, are the errors normally distributed around 0 or are they skewed? Are there any inputs for which our model has particularly high prediction errors?
Answers to these questions will help us find the source of these errors. For example, if the errors are not normally distributed around 0, this might indicate that the linear regression model is not suitable for our data set and we need to try other regression models (e.g., polynomial regression). Or, if our model has particularly high prediction errors on some samples, they might be outliers and we need to investigate where they come from.
A plot that can help you answer these questions is called a residuals plot. This plot shows the residuals on the y-axis vs. the predicted values of the model on the x-axis.
Let's write a function to create this plot:
def plot_residuals(y_train_pred, y_train, y_test_pred, y_test):
plt.scatter(y_train_pred, y_train_pred - y_train, s=2, marker='o', c='b', label='Training')
plt.scatter(y_test_pred, y_test_pred - y_test, s=2, marker='s', c='m', label='Test')
xmin = min(y_train_pred.min(), y_test_pred.min())
xmax = max(y_train_pred.max(), y_test_pred.max())
plt.hlines(y=0, xmin=xmin, xmax=xmax, color='black')
plt.xlim(xmin, xmax)
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend()
We can now call this function to show the residuals both on the training and the test sets:
plot_residuals(y_train_pred, y_train, y_test_pred, y_test)

We can see that most of the errors are symmetrically distributed around 0, but there are some outliers on the far ends of the input range, which may require further investigation.
Exercise
Download the Students Marks dataset from Kaggle. Build a linear regression model to predict a student's mark based on their study time and number of courses they took. Compute the RMSE and _R_² score of the model both on the training and the test sets. Plot the residuals vs. the predicted values. What can you learn from this plot on the data set?
Gradient Descent
Although the closed-form solution gives us a direct way to find the optimal parameters of the regression model, it suffers from a few drawbacks:
- The closed-form solution is computationally inefficient when we have a large number of features, since it requires the computation of the inverse of XᵗX, which is a m × m matrix (m is the number of features). Computing the inverse of a matrix has a runtime complexity of O(_m_³) under most implementations.
- It requires to have the entire design matrix X in memory, which is not always feasible if we have a very large data set.
- It does not support online (incremental) learning, since any change to the design matrix X requires re-computation of the inverse of XᵗX.
Gladly, there is an alternative approach for finding the optimal w, which is gradient descent. Gradient descent is an iterative approach for finding a minimum of a function, where we take small steps in the opposite direction of the gradient in order to get closer to the minimum:

In order to use gradient descent to find the minimum of the least squares cost, we need to compute the partial derivatives of J(w) with respect to each one of the weights.
The partial derivative of J(w) with respect to any of the weights wⱼ is:

Therefore, the gradient descent update rule is:

where α is a learning rate that controls the step size (0 < α < 1).
Instead of updating each component of w separately, we can update the entire vector w in one step:

Gradient descent can be applied in one of the following modes:
- Batch gradient descent – the weights are updated after we compute the error on the entire training set.
- Stochastic gradient descent (SGD) – a gradient descent step is performed after every training example. In this case, the gradient descent update rule takes the following form:

SGD typically converges faster than batch gradient descent as it makes progress after each example, and it also supports online learning since it can process new data points one at a time. On the other hand, SGD is less stable than batch gradient descent, and its convergence to the global optimum is not guaranteed (although in practice it gets very close to the optimum).
Note that whenever you use gradient descent, you must make sure that your data set is normalized (otherwise gradient descent may take steps of different sizes in different directions, which will make it unstable).
The SGDRegressor Class
The class SGDRegressor in Scikit-Learn implements the SGD approach for fitting a linear regression model.
The important hyperparameters of this class are:
- loss – the loss function used as the objective of the optimization. The options of this parameter are: _squarederror (squared loss, this is the default option), huber (Huber loss) and _epsilonintensive (the loss function used in Support Vector Regression). The difference between these loss functions is explained in this article.
- penalty – the type of regularization to use (defaults to ‘l2').
- alpha – the regularization coefficient (defaults to 0.0001).
- _maxiter – the maximum number of epochs over the training data (defaults to 1000).
- _learning_rat_e – learning rate schedule for the weight updates (defaults to ‘invscaling').
- eta0 – the initial learning rate used (defaults to 0.01).
- _earlystopping – whether to stop the training when the validation score is not improving (defaults to False).
- _validationfraction – the proportion of the training set to set aside for validation (defaults to 0.1).
Since we need to normalize our data before using SGD, we will create a pipeline that consists of two steps:
- A StandardScaler that normalizes the features by removing the mean and scaling them to unit variance.
- An SGDRegressor with its default settings.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('reg', SGDRegressor())
])
Let's fit the pipeline to our training set:
pipeline.fit(X_train, y_train)
And now let's evaluate the model on the training and test sets:
train_score = pipeline.score(X_train, y_train)
print(f'R2 score (train): {train_score:.4f}')
test_score = pipeline.score(X_test, y_test)
print(f'R2 score (test): {test_score:.4f}')
The scores we get are:
R2 score (train): -511.0496
R2 score (test): -4735.6157
These are very bad scores! What has just happened?
When you get such bad scores with gradient descent, it usually means that your learning rate is too high, which causes the algorithm to oscillate between the two sides of the minimum:

Let's reduce the learning rate from 0.01 to 0.001 by changing the parameter eta0 of the SGDRegressor:
pipeline.set_params(reg__eta0=0.001)
Let's refit the pipeline to the training set and reevaluate it:
pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
print(f'R2 score (train): {train_score:.4f}')
test_score = pipeline.score(X_test, y_test)
print(f'R2 score (test): {test_score:.4f}')
The scores we get this time are:
R2 score (train): 0.6018
R2 score (test): 0.5841
which are similar to the _R_² scores we obtained with the closed-form solution (they are a bit lower, since SGD gets close to the global minimum, but not to the minimum itself).
Key Takeaways
- In linear regression we are trying to find a linear relationship between a set of features and a target variable.
- The key assumptions are that the features are linearly independent, and that the error terms are independent of each other and normally distributed with a zero mean.
- In ordinary least squares (OLS) regression we try to minimize the sum of squared residuals. The cost function of OLS is convex, so it has a single minimum point.
- The two main approaches for finding the optimal parameters of the model are the closed-form solution and gradient descent. Gradient descent is favored when the data set is large, or when we need to support online learning.
- When using gradient descent, it is important to normalize your features and choose an appropriate learning rate.
- We evaluate the performance of regression models using _R_² score, which varies between 0 and 1 and measures how better the model is than a baseline model that always predicts the mean of the target, and RMSE, which is the square root of the mean squared errors.
Solution to the Duplicate Data Exercise
Recall that the closed-form solution is:

If we double the data points, then instead of X and y, we will have the following design matrix and target vector:

The matrix A has 2n rows and m columns, where rows 1, …, n of the matrix are identical to rows n + 1, …, 2n. Similarly, the vector z has 2n rows, where the first n rows are identical to the last n rows.
We can easily show that

Proof:

Similarly, we have (we can treat z as a matrix with one column):

Therefore, we can write:

We get the same weights as in the original model! i.e., the regression model will not change.
Final Notes
All images unless otherwise noted are by the author.
You can find the code examples of this article on my github: https://github.com/roiyeho/medium/tree/main/multiple_linear_regression
Thanks for reading!