Probabilistic View of Principal Component Analysis

One of the primarily used dimension reduction techniques in data science and Machine Learning is Principal Component Analysis (PCA). Previously, We have already discussed a few examples of applying PCA in a pipeline with Support Vector Machine and here we will see a probabilistic perspective of PCA to provide a more robust and comprehensive understanding of the underlying data structure. One of the biggest advantages of Probabilistic PCA (PPCA) is that it can handle missing values in a dataset, which is not possible with classical PCA. Since we will discuss Latent Variable Model and Expectation-Maximization algorithm, you can also check this detailed post.
What you can expect to learn from this post?
- Short Intro to PCA.
- Mathematical building blocks for PPCA.
- Expectation Maximization (EM) algorithm or Variational Inference? What to use for parameter estimation?
- Implementing PPCA with TensorFlow Probability for a toy dataset.
Let's dive into this!
1. Singular Value Decomposition (SVD) and PCA:
One of the major important concepts in Linear Algebra is SVD and it's a factorization technique for real or complex matrices where for example a matrix (say A) can be factorized as:

where U,Vᵀ are orthogonal matrices (transpose equals the inverse) and Σ would be a diagonal matrix. A need not be a square matrix, say it's a N×D matrix so we can already think of this as our data matrix with N instances and D features. U,V are square matrices (N×N) and (D×D) respectively, and Σ will then be an N×D matrix where the D×D subset will be diagonal and the remaining entries will be zero.
We also know Eigenvalue decomposition. Given a square matrix (B) which is diagonalizable can be factorized as:

where Q is the square N×N matrix whose _i_th column is the eigenvector _qi of B, and Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues.
Let's try to modify equation (1) by multiplying it by Aᵀ.

Here, AᵀA would be a Square matrix even though A initially didn't need to be (could be m×n). Σ Σᵀ is a diagonal matrix and V is an orthogonal matrix. Now, this is basically the eigendecomposition of a matrix AᵀA. The eigenvalues here are squares of the singular values for A in eq. (1).
For a positive semi-definite matrix SVD and eigendecomposition are equivalent. PCA boils down to the eigendecomposition of the covariance matrix. Finding the maximum eigenvalue(s) and corresponding eigenvector(s) are basically then can be thought of as finding the direction of maximum variance.
Given D dimensional data (features), a full eigendecomposition would be expensive ∼O(D³), but now if we choose some latent space dimension M(<D) then the calculation is cheaper ∼O(MD²).
2. Building Blocks for PPCA:
2.1. Assumptions:
PPCA is a Latent Variable Model (LVM) and we've discussed LVM in detail before including the Expectation-Maximization (EM) algorithm. LVMs offer a low-dimensional representation of the data. Let's say our data (x) is (N×D) dimensional, with D features; then LVM for PCA seeks an M dimensional latent vector of unobserved variables z which can be used to generate the observed variables (x) and they are related via a linear relationship:

The noise ϵ in the equation above is a D dimensional vector with a zero mean Gaussian and covariance with σ²I;

Since we know that our latent space is M dimensional, this leaves our W vector to be D×M dimensional. The latent variable z is assumed to have a zero-mean, unit-covariance Gaussian distribution:

The above two equations lead to a conditional distribution of x conditioned on z as below:

which is another normal distribution with mean Wz (we can set μ = 0) and covariance σ².
The above equation should remind us of a fundamental property of a normal distribution: i.e., if x follows multivariate normal distribution x∼N(μ, Σ), then any linear transformation of x is also multivariate normal distribution y = Ax + b ∼ N(Aμ+b, _A_ΣAᵀ).
Given the joint distribution, the marginal distribution would also be Gaussian:

As we would like to determine the parameters W, μ, σ we can approach this problem via MLE or EM algorithm. Here we will focus on the EM approach and then on Variational Inference. Both approaches are well described in Bishop's book. Bishop argues that as the data dimensionality increases, we may gain computational advantages over MLE via iterative EM steps. This has to do with mainly the computational cost of the covariance matrix, where the evaluation of the D dimensional data covariance matrix takes O(ND²), N being the number of data points.
2.2. EM Steps for PPCA:
Before we discussed the EM algorithm in reference to Gaussian Mixture Models here I will describe it in reference to PPCA. The steps for the EM algorithm are as follows:
- In the expectation step, we calculate the expectation of the complete data log-likelihood w.r.t its posterior distribution of the latent variables (p(z|x)) evaluated using the ‘old' parameters.
- Maximizing this data log-likelihood yields ‘new' parameters which will be plugged into step 1.
As the data points are independent the complete data likelihood would be:

The main target for the E-step is to calculate the expectation of the expression above. Here we have to use p(x|z), p(z) from equations 3 and 4 respectively in section 3. The derivation is given in Bishop's book but the important part is that the derivation requires the calculation of E[_zn], E[_z_n zᵀn], and, this can be derived using the posterior distribution p(z|x).
Once the E-step is done, the M-step then involves maximizing this expected log-likelihood w.r.t the parameters W,σ².
Variational Inference, EM & ELBO:
The EM steps described above depend on a certain crucial assumption that the posterior distribution p(z|x) is tractable (necessary for the E step in eq. 2.6.). What if that's not the case? What if there's not any analytic expression for the posterior? This is the base of Variational Inference.
We now take help from variational methods. The main idea is that we try to find a distribution q(z) that can be as close as the posterior distribution p(z|x). This approximation distribution can have its own variational parameters: q(z|θ), and we try to find the setting of the parameters that make q close to the posterior of interest. q(z) should be relatively easy and more tractable for inference. To measure the closeness of the two distributions q(z) and p(z|x), a common metric is the Kullback-Leibler (KL) divergence. The KL divergence for Variational Inference naturally introduces Evidence Lower Bound (ELBO) as:

where ELBO (q) is defined as:

For the derivation, you can check my notebook in the reference or any other available lecture notes.
Since KL divergence is non-negative, log p(x) ≥ ELBO(q). So what we do in Variational Inference (VI) is that we maximize the ELBO.
We can also easily see the connection between VI and the traditional EM algorithm; when q(x)==p(z|x) as the KL divergence term vanishes.
Since we have now completed the overview of the PPCA, we will now implement this using TensorFlow Probability and we will use the definition of ELBO and try to maximize it which in turn is equivalent to maximizing the data likelihood log p(x).
Implement PPCA Using TensorFlow Probability:
For implementing a simple example of PPCA via Variational Inference, I'll follow the original example from the TensorFlow Probability applications.
For implementing this we will assume that we know the standard deviation of the noise (σ, I chose 3.0) as defined in equation (2.2) and we place a prior over w and try to estimate via Variational Inference. Let's define the joint distribution:
Here we've used JointDistributionCoroutineAutoBatched
, which is an "auto-batched" version of JointDistributionCoroutine
. It automatically applies batch semantics to the joint distribution based on the shape of the input arguments. It allows for more flexible handling of batch dimensions. We can pass batched or unbatched arguments to the joint distribution, and it will handle the batch semantics automatically. After we sample from this joint distribution, using tf_model.sample()
in line 51, we plot the observed data (x) distribution (2D):

On a side note, as these data points are randomly sampled when you run the codes, you may not get exactly similar points.
We try to think that the posterior p(W, Z|X) can be approximated by a simpler distribution q(W, Z) parameterized by θ. In VI our aim is to minimize the KL divergence between q(W, Z) and p(W, Z|X), which on the other hand from equation (2.8) would be to maximize the Evidence Lower Bound (ELBO).
The ELBO in this case:

To minimize the ELBO we will first define the surrogate distribution similar to the way we defined the joint distribution, and use the tfp.vi
method to fit a surrogate posterior to a target (unnormalized) log density.
After we obtain the surrogate distribution, we can then use it to sample new data points resulting in the plot below:

Conclusions:
We have gone through the Mathematics behind the PPCA and used TensorFlow to test our understanding with a simple toy data-set. The possibility of generating new samples using the surrogate posterior gives us the power to impute missing values in data, generate new samples etc., which is impossible with standard PCA. Many real-world datasets exhibit intricate relationships, noise corruption, and incomplete observations, rendering classical PCA less effective. By accounting for uncertainty, handling missing data, and offering a probabilistic modeling framework, PPCA opens up new avenues for data analysis, pattern recognition, and machine learning. Next time if you're planning to use PCA for your dataset, but you are aware that observations could be noisy and there are missing data, why not try PPCA?
Probabilistic PCA is also closely related to Factor Analysis (FA) which is a linear-Gaussian latent variable model. The main difference between FA and PPCA is that in Eq. 2.2 when we describe the noise distribution, the covariance is assumed to be isotropic in PPCA, whereas in FA it's a diagonal matrix. I'll leave some references below for more explorations that you can do inspired by this post.
References:
[1] ‘Probabilistic Principal Component Analysis'; M. Tipping, C. Bishop; J.R. Statist. Soc. B (1999).
[2] ‘Pattern Recognition and Machine Learning'; C. Bishop; Chapter 12: Continuous Latent Variables.
[3] ‘Continuous Latent Variable Models'; University of Toronto; Lecture Notes.
[4] Probabilistic PCA: TensorFlow Probability Example;
[5] Link to my notebook. GitHub