Stable Diffusion: Mastering the Art of Interior Design

Author:Murphy | View: 20722 | Time: 2025-03-22 23:40:53

In this fast-paced world that we live in and after the pandemic, many of us realised that having a pleasant environment like home to escape from reality is priceless and a goal to be pursued.

Whether you are looking for a Scandinavian, minimalist, or a glamorous style to decorate your home, it is not easy to imagine how every single object will fit in a space full of different pieces and colours. For that reason, we usually seek for professional help to create those amazing 3D images that help us understand how our future home will look like.

However, these 3D images are expensive, and if our initial idea does not look as good as we thought, getting new images will take time and more money, things that are scarce nowadays.

In this article, I explore the Stable Diffusion model starting with a brief explanation of what it is, how it is trained and what is needed to adapt it for inpainting. Finally, I finish the article with its application on a 3D image from my future home where I change the kitchen island and cabinets to a different colour and material.

As always, the code is available on Github.

Stable Diffusion

What is it?

Stable Diffusion [1] is a generative AI model released in 2022 by CompVis Group that produces photorealistic images from text and image prompts. It was primarily designed to generate images influenced by text descriptions but it can be used for other tasks such as inpainting or video creation.

Its success comes from the Perceptual Image Compression step that converts a high dimensional image into a smaller latent space. This compression enables the use of the model in low-resourced machines making it accessible to everyone, something that was not possible with the previous state-of-the-art models.

Figure 2: Stable Diffusion architecture (source)

How does it learn?

Stable Diffusion is a Latent Diffusion Model (LDM) with three main components (variational autoencoder (VAE) [2], U-Net [3] and an optional text encoder) that learns how to denoise images conditioned by a prompt (text or other image) in order to create a new image.

The training process of Stable Diffusion has 5 main steps:

** The Perceptual Image Compression step consists in an Encode**r that receives an image with a dimension of _512x512x_3 and encodes it into a smaller latent space Z with a dimension of 64x64x4. To better preserve the details of an image (for example, the eyes in human face), the latent space __ Z is regularized using a low-weighted Kullback-Leibler-term to make it zero centered and to obtain a small variance.

Figure 3: Perceptual Image Compression process where the Encoder converts a 512x512x3 image to a latent space of 64x64x4 (image made by the author).

The Diffusion Process is responsible to progressively add Gaussian noise to the latent space Z, until all that remains is random noise, generating a new latent space Zt. t is the number of times the diffusion process occurred to achieve a full noisy latent space. This step is important because Stable Diffusion has to learn how to go from noise to the original image as we will see in the next steps.

Figure 4: Diffusion Process where Gaussian noise is added gradually to the latent space (image made by the author)

The Denoising Process trains a U-Net architecture to estimate the amount of noise in the latent space Zt in order to subtract it and restore Z. This process is able to recover the original latent space Z by gradually denoise Zt, basically, the inverse of the Diffusion Process.

Figure 5: Denoising Process where U-Net predicts the noise in a latent space and removes it until it completely restores the original latent space (image made by the author)

During the Denoising Process a prompt, usually text and/or other image, can be concatenated to the latent space Zt. This concatenation will condition the Denoising Process which allows the creation of new images. The authors added cross-attention mechanisms in the backbone of U-Net to handle these prompts since they are effective for learning attention-based models of various inputs types. When it comes to text, the model uses a trained text encoder CLIP [4] that encodes the prompt into a 768-dimensional vector which is then concatenated to Zt and received by U-Net as input.

As we can see in Figure 6, we concatenated to Zt the text prompt "remove the lamp", which conditioned the Diffusion Process restoring a Zt without the lamp near the chair that the original Zt had.

Figure 6: Condition the denoising process with a text prompt to remove the lamp in the original image (image made by the author)

Finally, the Decoder receives the denoised latent space Z as input and it learns how to estimate the component-wise variance used to encode the image into a smaller latent space. After estimating the variance, the Decoder can generate a new image with the same dimension of the original one.

Figure 7: Decoder restores the original image without the lamp and with the original size of 512x512x3 (image made by the author)

Inpainting Variant of Stable Diffusion

Inpainting is the task of filling masked regions of an image with new content either because we want to uncorrupt the image or because we want to replace some undesired content.

Stable Diffusion can be trained to generate new images based on an image, a text prompt and a mask. This type of model is already available in HuggingFace

Tags: AI Computer Vision Interior Design Machine Learning Stable Diffusion