Why is Feature Scaling Important in Machine Learning? Discussing 6 Feature Scaling Techniques

Author:Murphy | View: 23429 | Time: 2025-03-23 13:02:41

Many machine-learning algorithms need to have features on the same scale.

There are diffident types of feature scaling methods that we can choose in various scenarios. They have different (technical) names. The term Feature Scaling simply refers to any of those methods.

Topics
------
1. Feature scaling in different scenarios

   a. Feature scaling in PCA
   b. Feature scaling in k-means
   c. Feature scaling in KNN and SVM
   d. Feature scaling in linear models
   e. Feature scaling in neural networks
   f. Feature scaling in the convergence
   g. Feature scaling in tree-based algorithms
   h. Feature scaling in LDA

2. Feature scaling methods

   a. Standardization
   b. Min-Max Scaling (Normalization)
   c. Robust Scaling
   d. Mean Normalization
   e. Maximum Absolute Scaling
   f. Vector Unit-Length Scaling

3. Feature scaling and distribution of data

4. Data leakage when feature scaling

5. Summary of feature scaling methods

Feature scaling in different scenarios

Feature scaling in PCA: In principal component analysis, PCA components are highly sensitive to the relative ranges of the original features, if they are not measured on the same scale. PCA tries to choose the components that maximize the variance of the data. If the maximization of various occurs due to higher ranges of some features, those features may tend to dominate the PCA process. In this case, the true variance may not be captured by the components. To avoid this, we generally perform feature scaling before PCA. However, there are two exceptions. If there is no significant difference in the scale between the features, for example, one feature ranges between 0 and 1 and another ranges between 0 and 1.2, we do not need to perform feature scaling although there will be no harm if we do! If you perform PCA by decomposing the correlation matrix instead of the covariance matrix, you do not need to do feature scaling even though the features are not measured on the same scale.
Feature scaling in k-means clustering: One of the main assumptions in k-means clustering is that all features are measured on the same scale. If not, we should perform feature scaling. The k-means algorithm calculates the distance between data points. Features with a higher range may dominate the calculations and those calculations may not be accurate. To avoid this, we need to perform feature scaling before k-means. Scaling features will also improve the training speed of k-means models.
Feature scaling in KNN and SVM algorithms: In general, algorithms that calculate the distance between data points are mostly affected by the relative ranges of features. KNN and SVM are not exceptions. Features with a higher range may contribute more because of the higher range, but not because of its importance. We do not want to algorithm to be biased in that way. Therefore, we need to scale features to contribute equally to distance calculations.
Feature scaling in linear models: The parameter values of linear models such as linear regression are highly dependent on the scale of the input features. Therefore, it is better to use the features measured on the same scale. That will also improve the training speed of linear models.
Feature scaling in neural networks: We usually apply feature scaling methods to input data. But, it is also possible to apply feature scaling to the activation values of hidden layers in a neural network! The scaled output values then become the inputs to the next layers. This is called batch normalization which can effectively eliminate the vanishing gradient problem and covariate shift problem and enhance the stability of the network during the training. It also speeds up the training process of neural network models.
Feature scaling in the convergence of algorithms: The learning rate is the main factor that decides the speed of the convergence of deep learning and machine learning algorithms. Feature scaling also has an effect on this! When the features are measured on the same scale, the calculations can be performed much faster and the algorithms converge faster!
Feature scaling in tree-based algorithms: Feature scaling is not necessary for tree-based algorithms as they are not much sensitive to the relative scale of features. Popular tree-based algorithms are: Decision Tree, Random Forest, AdaBoost, Gradient Boosting, XGBoost, LightGBM, and CatBoot.
Feature scaling in LDA: Linear discriminant analysis (LDA) is a linear dimensionality reduction technique that performs dimensionality reduction by maximizing the class separability of classification datasets, not by maximizing the variance of the data. Therefore, LDA is not sensitive to the relative ranges of the features and feature scaling is not necessary for LDA.

Feature scaling methods

1. Standardization

Standardization is the most popular feature scaling method in which we center values at 0 and a unit standard deviation. The result is the z-score and therefore, this scaling method is also known as z-score standardization or normalization. After applying standardization to a feature, the data has a distribution with a mean of 0 and a standard deviation of 1. That kind of distribution is called a standard normal distribution.

To apply standardization to a variable, first, we need to calculate the mean and the standard deviation of that variable. Then, we subtract the mean from each value and divide the result by the standard deviation. For a set of features, these calculations are performed feature-wise simultaneously.

**Standardization formula** (Image by author)

The z-score or the output of standardization practically means how many standard deviations a value deviates from the mean!

The z-score values are not bounded to a certain range. The standardization process is not affected by the presence of outliers in the data.

Standardization is particularly useful when the data follows a Gaussian or normal distribution or the distribution is unknown.

In Scikit-learn, standardization can be performed using the StandardScaler() function.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(data)
scaled_data = sc.transform(data)

When calling the fit() method, the mean and standard deviation of each variable are calculated. If we need to apply the scaling process to data, we should also call the transform() method.

2. Min-Max Scaling (Normalization)

The min-max scaling or normalization is the process of scaling data into a specific range of your choice. The most commonly used range is (0, 1).

To apply min-max scaling to a variable, first, we need to find the minimum and maximum values of that variable. Then, we subtract the minimum from each data value and divide the result by the range (the difference between the maximum and minimum). For a set of features, these calculations are performed feature-wise simultaneously.

**Min-max scaling formula** (Image by author)

The min-max scaling is particularly useful when the distribution of data is not known or does not follow the normal distribution.

The min-max scaling process is highly sensitive to the outlier in the data.

In Scikit-learn, min-max scaling can be performed using the MinMaxScaler() function.

from sklearn.preprocessing import MinMaxScaler

sc = MinMaxScaler(feature_range=(0,1))
sc.fit(data)
scaled_data = sc.transform(data)

When calling the fit() method, the minimum and maximum values of each variable are found. If we need to apply the scaling process to data, we should also call the transform() method.

The MinMaxScaler() function also provides an option to change the range of your choice. The default is set to (0, 1).

3. Robust Scaling

The robust scaling is also known as quantile scaling in which we scale data based on 1st, 2nd and 3rd quantiles. The 2nd quantile is the median as you already know.

To apply robust scaling to a variable, first, we need to find the quantiles of that variable. Then, we subtract the median (2nd quantile or Q2) from each data value and divide the result by the IQR (the difference between 3rd and 1st quantiles). For a set of features, these calculations are performed feature-wise simultaneously.

**Robust scaling formula** (Image by author)

**Robust scaling alternative formula** (Image by author)

The robust scaling is particularly useful when there are outliers in the data. This is because quantiles are robust to outliers (hence the name, robust scaling!).

In Scikit-learn, robust scaling can be performed using the RobustScaler() function.

from sklearn.preprocessing import RobustScaler

sc = RobustScaler()
sc.fit(data)
scaled_data = sc.transform(data)

When calling the fit() method, the quantile values of each variable are found. If we need to apply the scaling process to data, we should also call the transform() method.

4. Mean Normalization

Mean normalization is another popular feature scaling technique in which we subtract the mean from each data value and divide the result by the range (the difference between the maximum and minimum).

The formula is quite similar to the min-max scaling formula, except we subtract the mean from each data value, instead of subtracting the minimum from each data value.

**Mean normalization formula** (Image by author)

Mean normalization cannot be directly implemented in Scikit-learn as there is no dedicated function to do that. But, we can create a Scikit-learn pipeline by combing StandardScaler() and RobustScaler() transformers to perform mean normalization. A pipeline sequentially applies multiple transformers to data.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

ss = StandardScaler(with_mean=True, with_std=False)
rs = RobustScaler(with_centering=False, quantile_range=(0, 100))

mean_normalizer = Pipeline([('Std_Scaler', ss),
                            ('Rob_Scaler', rs)
])

mean_normalizer.fit(data)
scaled_data = mean_normalizer.transform(data)

In the above code sample, I have customized both StandardScaler() and RobustScaler() transformers by changing their default hyperparameter values!

ss = StandardScaler(with_mean=True, with_std=False)

This code line does not perform typical standardization. Instead, it subtracts the mean from each value and divides the result by 1 because with_std=False. The output of this will become the input to the next transformer.

rs = RobustScaler(with_centering=False, quantile_range=(0, 100))

By setting with_centering=False, it does not subtract the median from the input values. In addition to that, it divides the input values by range (the difference between maximum and minimum) with quantile_range=(0, 100). The 0 represents the minimum and 100 represents the maximum.

The pipeline applies the above-customized transformers to data sequentially. When the data goes through the above pipeline, the mean normalization will be performed!

5. Maximum Absolute Scaling

Maximum absolute scaling can be performed by dividing every data value by the maximum value of the feature. Its formula is very simple as compared to previous methods.

**Maximum absolute scaling formula** (Image by author)

The previous methods center the data by subtracting either mean, minimum or median from each data value. But, maximum absolute scaling does not center the data in that way. Therefore, this method works well with sparse data in which **** most of the values are zero.

In Scikit-learn, maximum absolute scaling can be performed using the MaxAbsScaler() function.

from sklearn.preprocessing import MaxAbsScaler

sc = MaxAbsScaler()
sc.fit(data)
scaled_data = sc.transform(data)

When calling the fit() method, the maximum values of each variable are found. If we need to apply the scaling process to data, we should also call the transform() method.

6. Vector Unit-Length Scaling

A unit vector has a magnitude of 1. To convert a non-zero vector to a unit vector, you need to divide that vector by its length which is calculated by using either Manhattan distance (L1 norm) or Euclidean distance (L2 norm) of that vector.

Now, consider the following non-zero vector.

The length (magnitude) of this vector can be calculated by using the Manhattan distance (L1 norm).

**Manhattan distance (L1 norm) of the vector** (Image by author)

The length (magnitude) of the vector can also be calculated by using the Euclidean distance (L2 norm) which is the most commonly used method.

**Euclidean distance (L2 norm) of the vector** (Image by author)

To convert x to a unit vector, we need to divide it by the length which is a real number as shown above. This is called normalizing the vector. After normalizing, the vector becomes a unit vector of magnitude (length) 1 and has the same direction as x.

**Unit vector formula** (Image by author)

The previous feature scaling methods performed calculations feature-wise, i.e., by considering each feature across all observations. However, in unit length scaling, the calculations are performed observation-wise, i.e., by considering each observation across all features.

To illustrate this, consider the following tabular data.

There are three observation vectors in the data. For example, the first observation can be represented by the following vector,

Ob1 = (2, 3, 5)

Ob1 = (2, 3, 5)

Length of Ob1 = Sqrt(2² + 3² + 5²) = Sqrt(38) --> Euclidean distance (L2 norm)

Unit Vector = [2/Sqrt(38), 3/Sqrt(38), 5/Sqrt(38)]

Length of Unit Vector = 1

All other observations can be represented in a similar format.

The unit length scaling is applied to observation vectors in this way.

In Scikit-learn, unit length scaling is performed using the Normalizer() function.

from sklearn.preprocessing import Normalizer

sc = Normalizer(norm='l2')
sc.fit(data)
scaled_data = sc.transform(data)

When calling the fit() method, nothing happens this time. It only validates the model's parameters. In previous cases, the parameters were learned. Calling the transform() method will divide each observation vector by its length, i.e. perform unit length scaling.

The Normalizer() function also provides an option to change the distance type, Manhattan (‘l1' norm) or Euclidean (‘l2' norm). By default, ‘l2' is used.

Feature scaling and distribution of data

Any feature scaling method discussed above does not change the underlying distribution of data. Before and after applying feature scaling, the variable's distribution remains unchanged! Only the range of values will be changed!

To illustrate this, I will create two histograms of the same feature before and after applying z-score standardization.

# Load iris data
from sklearn.datasets import load_iris

iris_data = load_iris().data
p = iris_data[:, 0] # Select the first feature

# Apply feature standardization
from sklearn.preprocessing import StandardScaler
p_scaled = StandardScaler().fit_transform(iris_data)[:, 0]

import matplotlib.pyplot as plt
plt.style.use('ggplot')

fig = plt.figure(figsize=(5.5, 4))
plt.hist(p, bins=20)
plt.title("Before scaling")
plt.show()

**Histogram of unscaled variable** (Image by author)

fig = plt.figure(figsize=(5.5, 4))
plt.hist(p_scaled, bins=20, color='green')
plt.title("After scaling")
plt.show()

**Histogram of scaled variable** (Image by author)

The variable's distribution remains unchanged! But, the range of values will be changed!

import numpy as np

np.min(p), np.max(p) # Before scaling

**The range before scaling** (Image by author)

np.min(p_scaled), np.max(p_scaled) # After scaling

**The range after scaling** (Image by author)

Data leakage when feature scaling

In Data Preprocessing, data leakage happens when some information in the train set leaks to the test set.

When training a model, the data used for training should not be used for testing. That's why we split a dataset into two parts: train and test sets. The train and test sets should be independent and should not be mixed up.

In Feature Scaling, data leakage easily happens in two ways.

When performing feature scaling before splitting data into train and test sets

from sklearn.datasets import load_iris
X = load_iris().data
y = load_iris().target

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, 
                                                    test_size=0.20,
                                                    random_state=42)

When calling the fit() method of the scaler (sc), the parameters (means and standard deviations of each variable) are learned from the entire dataset. Some information in train and test sets may be mixed up because splitting is done after scaling the data.

To avoid this, you should do feature scaling after splitting data.

from sklearn.datasets import load_iris
X = load_iris().data
y = load_iris().target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20,
                                                    random_state=42)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

When calling the fit() method of the scaler (sc) twice on both training and test sets

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.fit_transform(X_test)

When calling the fit() method twice on both training and test sets, the parameters are learned twice. The parameters should only be learned on the training test, not on the test set. The learned parameters on the train set can also be applied to transform the test set too. In other words, you need to call the fit() method only once on the train set. In this way, we can avoid leaking data from the train set to the test set.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

# Or

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

Summary of feature scaling methods

Standardization: StandardScaler(), Usage – When the data follows a Gaussian or normal distribution or the distribution is unknown | Less sensitive to outliers in data
Min-Max Scaling (Normalization): MinMaxScaler(), Usage – When the distribution of data is not known or does not follow the normal distribution | Highly sensitive to the outlier in data
Robust Scaling: RobustScaler(), Usage – Robust to outliers in data
Mean Normalization: StandardScaler() and RobustScaler(), Usage – Sensitive to outliers in data
Maximum Absolute Scaling: MaxAbsScaler(), Usage – Does not center data unlike other methods | Works well with sparse data
Vector Unit-Length Scaling: Normalizer(), Usage – Performs calculations observation-wise, i.e., by considering each observation across all features (all other methods perform calculations feature-wise, i.e., by considering each feature across all observations)

This is the end of today's article.

Please let me know if you've any questions or feedback.

How about an AI course?

Neural Networks and Deep Learning Course

Join my private list of emails

Never miss a great story from me again. By subscribing to my email list, you will directly receive my stories as soon as I publish them.

Thank you so much for your continuous support! See you in the next article. Happy learning to everyone!

References

Special credits go to the author of the following book which I read to get some knowledge on the feature scaling methods.

Python Feature Engineering Cookbook (2nd Edition 2022) by Soledad Galli

Iris dataset info

Source: You can download the original Iris dataset here.
Creator: R. A. Fisher
Citation: Fisher,R. A.. (1988). Iris. UCI Machine Learning Repository. https://doi.org/10.24432/C56C76
License: This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.