Sampling from Multivariate Distributions: From Statistical to Generative Modeling

Background
Sampling synthetic data from multivariate distributions is essential for understanding interdependencies, facilitating statistical inference, and quantifying uncertainty in data analysis. It is widely adopted in finance, engineering, medicine, environmental science, and social science. This process involves using mathematical models to fit the structure of data and generating new samples based on the fitted distributions. Modeling joint multivariate distributions has a long history in the realm of statistics. In simple cases, data could be modeled using predefined statistical distributions with explicit mathematical descriptions, such as multivariate Gaussian distributions and Copula functions – two classic statistical methods. However, with increasing complexity in data dimensions and dependencies, traditional methods fall short. Meanwhile, modern generative AI techniques like generative adversarial networks (GANs) and diffusion models have shown their potential.

This article reviews the fundamental ideas of the following methods:
- Kernel Density Estimation
- Copula
- Variational Autoencoder
- Generative Adversarial Networks
- Diffusion Model
We also explore their applications through experiments on both simple and complex synthetic data, providing a glimpse into both traditional and cutting-edge multivariate sampling techniques – no insane math involved!
Methods for Multivariate Sampling
1. Kernel Density Estimation (KDE)
Introduced in the 1950s [1], KDE is a classic statistical method for deriving probability density functions (PDF). The procedure of applying KDE is similar to smoothing out bumps in a rug: using a kernel function to represent the density for each data point, and then adding up all density values in the k-dimension space to create a smooth global density, as Figure 2 shows. Areas with high density values reflect a high sample probability.

The joint probability density function can be written as:

where K is the kernel function, which is parameterized by covariance matrix H and data point xi. Function K determines the way of point-wise probability smoothing. Since KDE assumes no specific form of any variable, it can capture complex relationships. Figure 3 illustrates frequently used kernel shapes in a 1-d view.

After deriving the joint probability density, new samples can be drawn from a binned space where each bin has a probability of locating data points.
2. Copula
Copula is another classic statistical method to model multivariate dependence, which can be traced back to 1959 [2]. The idea behind Copula is to model the dependence between all marginal cumulative probability functions (CDFs) rather than probability density functions (PDFs):

In the equations above, F is the cumulative probability function and U is the uniformly distributed random variable transformed from F(X). The Copula function C is the joint multivariate cumulative function of all Ui. Dependence between variables is embedded in the Copula function. Figure 4 depicts the dependence forms of four commonly used Copula families in a bivariate case:

Take the bivariate Gaussian Copula as an example. The dependence structure is represented by the covariance between variables in the bivariate normal distribution:

As Copula is totally based on classic statistical theory, new samples can be easily generated from the fitted joint distribution as other well-known statistical distributions.
3. Variational Autoencoder (VAE)
Starting from this section, we will step into the area of generative AI. The first method is the VAE, an unsupervised artificial neural network architecture that was first introduced in 2013 [3]. VAE consists of two components: an encoder that maps the original data point to a probabilistic latent space, and a decoder that reconstructs it from the latent space. Both components are user-defined artificial neural networks. The probabilistic latent space z is typically represented by a multivariate Gaussian distribution, which means the input data points are encoded into mean and variance vectors of the distribution.

To train a VAE, the goal is to reconstruct the data point as accurately as possible while avoiding cheating by memorizing data points. Therefore, the loss function is composed of a reconstruction error and a regularization term, the latter of which minimizes the discrepancy between the inferred latent distribution and the prior latent distribution:

Given the trained VAE, the data generation is the process of decoding random Gaussian noise, which is the mapping from the latent space back to the input space.

4. Generative Adversarial Networks (GAN)
GAN is another well-known generative AI technique dating back to 2014 [4]. It is an unsupervised training framework rather than a kind of specific network architecture, which has successful applications in image generation, art blending, photo inpainting, etc. GAN includes a generator to generate fake samples and a discriminator to identify real and fake samples. Both the generator and discriminator are flexible in network architectures for different application tasks.

GAN aims to train a strong generator and discriminator simultaneously in a zero-sum game. Therefore, the loss function, derived from a classic cross-entropy schema for binary classification, is in an adversarial form where the discriminator tries to maximize it while the generator does the opposite:

After training, the sampling process only uses the generator, which transforms random noise into a sample that lies in the distribution of the original training data.

5. Diffusion Model
Diffusion Models are the backbone of several latest popular image generation technologies, such as DALL-E and Stable Diffusion. Since its publication in 2015 [5], it has been a good alternative to GAN due to GAN's training difficulties. Inspired by thermodynamics, a diffusion model gradually destroys the structure of the data sample like a thermodynamic diffusion process, and subsequently reconstructs the sample through a reverse process. Specifically, the forward diffusion process adds random Gaussian noises to the sample iteratively for T steps, and the reverse process uses an artificial neural network to predict the noise to be removed given the noised sample at any step.

In the improved version of diffusion models, i.e., Denoising Diffusion Probabilistic Models (DDPM) [6], the training can be simplified by diffusing at any step from the original sample in one step. The training loss is the discrepancy between the predicted noises and the actual noises. As for the sampling, we can start from a Gaussian noise and predict the noises for reversing step-by-step to generate a new sample.

Data Experiment
This part of the article presents a data experiment to provide a quick understanding of the abilities of the five methods introduced above. Therefore, we focus on the results rather than the details of implementations and model tuning. The experiment uses these models to fit the distributions of the synthetic datasets and then generate new samples for comparison with the original ones.
Synthetic Data
Two synthetic datasets are designed. The first one contains samples from a 3-d joint Gaussian distribution, a "standard" multivariate distribution with simple marginal distributions and dependence relationships:

The second one contains samples from custom highly nonlinear equations, which represent a "non-standard" multivariate distribution with complex marginal distributions and dependence relationships:

where Beta() is the Beta distribution and N() is the normal distribution.
Evaluation
The synthetic datasets are split into a training set and a testing set. The paired scatter plots below show the marginal and bivariate distributions for original and generated samples in the testing set.
- KDE
Here we use Gaussian KDE models, which seem a little underfitted, given the marginal PDFs of generated samples are flatter in shapes compared with those of original samples. In addition, the highly non-linear dependence between x0 and x1 for non-standard distribution is not well captured.

- Copula
Gaussian Copula models are applied in the experiment. The Copula models perform well except for the dependence between x0 and x1 for non-standard distribution.

- VAE
The VAEs here use 2-hidden-layer fully connected neural networks for the encoder and decoder. They generally model the data well.

- GAN
For GAN, the generator and discriminator use 3-hidden-layer fully connected neural networks. GAN has similar performance with VAE.

- Diffusion Model
The diffusion models implement a 3-hidden-layer fully connected neural network for noise reverse process. Although capturing all the dependence relationships, the diffusion models produce samples with less variety than the original samples, indicating that further model tuning may be demanded.

Summary
Sampling from arbitrary multivariate distributions is an essential task in many fields, especially for making inferences with limited data. The increasing complexity of data necessitates a variety of candidate methods, ranging from statistical models to generative AI models. This article provides a brief introduction to the ideas of five popular methods. To dive deeper into this topic together, please follow me on Medium to get updates on future articles about:
- Metrics for goodness-of-fit evaluation.
- A Python library to automate the use of all methods above.
I look forward to your comments and feedback!
Reference
[1]. Rosenblatt, M. (1956). Remarks on Some Nonparametric Estimates of a Density Function. The Annals of Mathematical Statistics, 832–837.
[2]. Sklar, M. (1959). Fonctions de répartition à n dimensions et leurs marges. In Annales de l'ISUP (Vol. 8, №3, pp. 229–231).
[3]. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[4]. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative Adversarial nets. Advances in neural information processing systems, 27.
[5]. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015, June). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (pp. 2256–2265). PMLR.
[6]. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840–6851.