PCA/LDA/ICA : a components analysis algorithms comparison
Before diving and comparing the algorithms, let's review them independently.
Note that is article does not aim at explaining each algorithm in depth, but rather compare their goals and results.
If you want to know more about the difference between PCA and ZCA, check out my previous post based on numpy :
PCA : Principal Component Analysis
- PCA is an unsupervised linear dimensionality reduction technique that aims to find a new set of orthogonal variables that captures the most important sources of variability in the data.
- It is widely used for feature extraction and data compression, and can be used for exploratory data analysis or as a preprocessing step for machine learning algorithms.
- The resulting components are ranked by the amount of variance they explain, and can be used to visualize and interpret the data, as well as for clustering or classification tasks.
LDA : Linear Discriminant Analysis
- LDA is a supervised linear dimensionality reduction technique that aims to find a new set of variables that maximizes the separation between classes while minimizing the variation within each class.
- It is widely used for feature extraction and classification, and can be used to reduce the dimensionality of the data while preserving the discriminative information between classes.
- The resulting components are ranked by their discriminative power, and can be used to visualize and interpret the data, as well as for classification or regression tasks.
ICA : Independent Component Analysis
- ICA is an unsupervised linear dimensionality reduction technique that aims to find a new set of variables that are statistically independent and non-Gaussian.
- It is widely used for signal processing and source separation, and can be used to extract underlying sources of variability in the data that are not accessible through other techniques.
- The resulting components are ranked by their independence, and can be used to visualize and interpret the data, as well as for clustering or classification tasks.
Results on the iris dataset
Let's compare their results on the famous iris dataset using sklearn. First let's plot the iris dataset using a pairplot on each of the 4 numerical features, and color as the categorical feature :
Python">import seaborn as sns
import matplotlib.pyplot as plt
from Sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
data = iris.data
target = iris.target
target_names = iris.target_names
# Convert the iris dataset into a pandas DataFrame
iris_df = sns.load_dataset('iris')
iris_df['target'] = target
# Generate the pairplot∑
sns.pairplot(data=iris_df, hue='target', palette=['navy', 'turquoise', 'darkorange'], markers=['o', 's', 'D'],
plot_kws=dict(s=25, alpha=0.8, edgecolor='none'), diag_kws=dict(alpha=0.8, edgecolor='none'))
# Set the title and adjust plot spacing
plt.suptitle('Iris Pairplot')
plt.subplots_adjust(top=0.92)
plt.show()

We can now compute each transformation and plot the results. Notice we use only 2 components, since LDA requires at most (N-1) components where N is the number of categories (here equal to 3 since there are 3 types of iris flowers).
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, FastICA
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
# Standardize the data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Apply LDA with 2 components
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_std, y)
# Apply PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)
# Apply ICA with 2 components
ica = FastICA(n_components=2)
X_ica = ica.fit_transform(X_std)
# Plot the results
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
for target, color in zip(range(len(target_names)), ['navy', 'turquoise', 'darkorange']):
plt.scatter(X_lda[y == target, 0], X_lda[y == target, 1], color=color, alpha=.8, lw=2,
label=target_names[target])
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.subplot(1, 3, 2)
for target, color in zip(range(len(target_names)), ['navy', 'turquoise', 'darkorange']):
plt.scatter(X_pca[y == target, 0], X_pca[y == target, 1], color=color, alpha=.8, lw=2,
label=target_names[target])
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.subplot(1, 3, 3)
for target, color in zip(range(len(target_names)), ['navy', 'turquoise', 'darkorange']):
plt.scatter(X_ica[y == target, 0], X_ica[y == target, 1], color=color, alpha=.8, lw=2,
label=target_names[target])
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('ICA')
plt.xlabel('IC1')
plt.ylabel('IC2')
plt.show()
This code loads the Iris dataset, applies LDA, PCA, and ICA with 2 components each, and then plots the results using different colors for each class.
Notice that it is generally a good practice to standardize the data before applying PCA, ICA, or LDA. Standardization is important because these techniques are sensitive to the scale of the input features. Standardizing the data ensures that each feature has a mean of zero and a standard deviation of one, which puts all the features on the same scale and avoids one feature dominating over the others.
Since LDA is a supervised dimensionality reduction technique, it takes class labels as input. In contrast, PCA and ICA are unsupervised techniques, meaning that they only use the input data and do not take class labels into account.
The results of LDA can be interpreted as a projection of the data onto a space that maximizes class separation, whereas the results of PCA and ICA can be interpreted as a projection of the data onto a space that captures the most important sources of variability or independence, respectively.

Notice that ICA still shows separation between the category, althought not its purpose : that's because the categories are already quite sorted in the input dataset.
Let's put the LDA aside and focus on the differences between PCA and ICA- since LDA is a supervised technique, focuses on separating categories and enforces a maximum of component, while PCA and ICA focus on creating a new matrix with the same shape as the input matrix.
Let's see the ouputs for 4 components, both for PCA and ICA :


Let's also compare the correlation matrix for each transformed data : notice that both methods result in uncorrelated vectors (in other words, the transformed data features are orthogonal). That is because it's a constraint in the PCA algorithm – each new vector must be orthogonal to the previous ones-, and a consequence of the ICA algorithm – which implies that the original dataset are independent signals that have been mixed together and must be reconstructed.


So PCA and ICA seem to give results with similar properties : that is because of the 2 following reasons :
- the independance is "encoded" in both algorithms
- the iris dataset exhibits well separeted classes
That's why we need another example, more fitted for the ICA.
Another example :
Let's see another example : we first generate a synthetic dataset with two independent sources, a sine wave and a square wave, which are mixed together as a linear combination to create a mixed signal.
The actual, true, independant signals are the following :

They are mixed together, as 2 linear combinations :

Let's see how PCA and ICA perform on this new dataset :

Notice how PCA created a new component that exhibits a lot of variance, as a linear combination of the inputs, but that absolutely does not match the original data : that's indeed not the purpose of PCA.
On the opposite, ICA performed very well in recovering the original dataset, independantly of variance composition.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import FastICA
# Generate a synthetic dataset with two independent sources
np.random.seed(0)
n_samples = 2000
time = np.linspace(0, 8, n_samples)
s1 = np.sin(2 * time) # Source 1: sine wave
s2 = np.sign(np.sin(3 * time)) # Source 2: square wave
S = np.c_[s1, s2]
S += 0.2 * np.random.normal(size=S.shape) # Add noise to the sources
S /= S.std(axis=0) # Standardize the sources
# Mix the sources together to create a mixed signal
A = np.array([[0.5, 0.5], [0.2, 0.8]]) # Mixing matrix
X = np.dot(S, A.T) # Mixed signal
# Standardize the data
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
# Use PCA to reduce the dimensionality of the data
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Use ICA to separate the sources from the mixed signal
ica = FastICA(n_components=2)
X_ica = ica.fit_transform(X) # Estimated sources
# Plot the results
plt.figure()
models = [X, S, X_pca, X_ica]
names = ['Observations (mixed signal)',
'True Sources',
'PCA features', 'ICA estimated sources']
colors = ['red', 'steelblue']
for ii, (model, name) in enumerate(zip(models, names), 1):
plt.subplot(4, 1, ii)
plt.title(name)
for sig, color in zip(model.T, colors):
plt.plot(sig, color=color)
plt.tight_layout()
plt.show()
Conclusion
The PCA, LDA, and ICA algorithms might seem like a custom version of each other, but they really do not have the same purpose. To summup:
- PCA aims to create new components that hold the maximum variance of the input
- LDA aims to create new components that separate clusters based on a categorical feature
- ICA aims to retrieve original features that are mixed together in a linear combination in the input dataset
Hopefully, you understand better the differences between these algorithm and will be able to quickly identify the one you need in the future.
If you liked this story, make sure to follow me and help me reach my 100 subscribers goal