Boosting image generation by intersecting GANs with Diffusion models

Author:Murphy | View: 20430 | Time: 2025-03-23 18:41:27

A recipe for stable and efficient image-to-image translation

ATME is a model in the GAN ∩ Diffusion class. Image generated using DALL·E 2.

Visual Foundation Models (VFM) are at the core of cutting-edge technologies such as Visual ChatGPT¹. In this article, we will briefly discuss recent advances to blend two important ingredients of the VFM soup: GANs and Diffusion models, ending up in ATME at their intersection. ATME is a novel model that I introduced in the paper Look ATME: The Discriminator Mean Entropy Needs Attention², with GitHub repository available here.

We will go through relevant weaknesses and strengths of each type of generative modeling. Then we discuss two categories of solutions to merge them: the naive GAN ∪ Diffusion and, in more depth, the efficient GAN ∩ Diffusion classes of models. At the end, you will get a picture of how research around some of the VFM is currently evolving.

Generative models

First, a little bit of background. The aim of (conditional) generative models is to learn how to generate data y from a target domain using information x from a source domain. Both domains can be images, text, semantic maps, audio, etc. Two types of modeling have become very successful: Generative Adversarial Networks (GANs) and Diffusion Probabilistic Models. Concretely,

GANs learn how to sample from the data distribution p(y∣x) by training a generator model that produces data distributed according to g(y∣x). It uses a discriminator model that guides the generator from blindly to accurately generating data by minimizing a divergence (or distance) between the distributions g and p.
Diffusion models learn how to sample from p(y∣x) by reducing the latent variables _y_₁, _y_₂, ⋯, _y_ₙ from p(y∣x, _y_₁, _y_₂, ⋯, _y_ₙ). These variables are a sequence of increasingly noisy versions of y (or an encoding of y), and the reduction is done by learning a denoising model.

If you need more details about these types of modeling, there are countless sources available online. For GANs, you may want to start from this article and, for Diffusion models, from this one.

*Figure 1: Visual ChatGPT demo from Microsoft. Used with permission.*

Now that we have set the basics, let's discuss some applications. Figure 1 shows the official Visual ChatGPT demo. It uses several models for vision-language interactions, some of which are listed in the table below.

Most of these are generative models, with the majority being based on Stable Diffusion. This speaks about a recent switch of interest from GANs to Diffusion models, triggered by evidence³ that the latter are superior on image synthesis than the former. One take away from the present article is that this doesn't imply that Diffusion models are better than GANs for all image generation tasks, as the aggregation of these models tends to perform better than the independent parts.

Before discussing this and arriving to ATME, let's pave the way by revisiting the main weaknesses and strengths of GANs and Diffusion models.

GANs

The main premise introduced in the original GAN paper⁴ and emphasized in the tutorial is that, in the limit of a large enough model and infinite data, the minimax game played by the generator and discriminator converges to the Nash equilibrium, where the (vanilla) GAN objective achieves the value −log4. In practice, however, this is hardly observed. The departures from this theoretical result give rise to what is popularly known as the training instability of GANs. This, together with mode collapse, are their main drawbacks. They are compensated by still high image generation quality achieved, in one shot, with lightweight models.

Diffusion Models

In contrast, Diffusion models are stable but known to be inefficient due to the large number of steps required to learn the denoising distribution. This is the case because such a distribution is commonly assumed to be Gaussian, which is only justified in the infinitesimal limit of small denoising steps.

Recently developed alternatives to reduce the number of denoising steps (even further down to 2), using multi-modal distributions, exist. This requires combining Diffusion models with GANs, as we discuss in the following.

GAN ∪ Diffusion

Current approaches to train GANs with Diffusion are very promising. They can be classified as belonging to the GAN ∪ Diffusion class of models that use generative adversarial training together with multi-step diffusion processes.

In order to improve training stability and mode coverage in the GANs, these models inject instance noise by following diffusion processes which may have from up to thousands of steps (as in _Diffusion-GAN⁵) to as few as two steps (as in Denoising Diffusion GANs_⁶). These models perform better than strong GAN baselines on various datasets, but still need multiple denoising steps. So,

is it possible to generate images with a GAN in one shot and still leverage denoising diffusion processes?

The answer is yes, and this defines the GAN ∩ Diffusion class of models.

GAN ∩ Diffusion

It turns out that a single trick can make the pix2pix⁷ visual foundation GAN model stable by design: paying attention to the discriminator mean entropy.

*Figure 2: ATME generates images using the UNet from Diffusion models, which are judged by the patchGAN discriminator from pix2pix. Image by author.*

The resulting model, ATME, is shown in Figure 2. Given a joint distribution p(x,y) of source and target images, the input image x at epoch t is corrupted with Wₜ =W(Dₜ-), as follows

xₜ = x (1+Wₜ)

with W being a small deterministic net which transforms into Wₜ the discriminator decision map Dₜ- at the previous epoch

Tags: Artificial Intelligence Computer Vision Diffusion Models Gans Image Translation