The Arrival of SDXL 1.0

Author:Murphy | View: 26686 | Time: 2025-03-23 13:12:56

A cute little robot learning how to paint – Created by Using SDXL 1.0

In the rapidly evolving world of machine learning, where new models and technologies flood our feeds almost daily, staying updated and making informed choices becomes a daunting task. Today, we direct our focus towards SDXL 1.0, a text to image generative model which has undoubtedly garnered considerable interest within the field.

SDXL 1.0, short for "Stable Diffusion XL," is touted as a latent text-to-image diffusion model that supposedly surpasses its predecessors with a range of promising enhancements. In the upcoming chapters, we will closely examine these claims and take a closer look at the new enhancements.

Most notably, SDXL is an open-source model that addresses a major concern in the domain of generative models. While black-box models have gained recognition as state-of-the-art, their architecture's opacity hinders a comprehensive assessment and validation of their performance, limiting broader community involvement.

In this article, we embark on a detailed exploration of this promising model, inspecting its capabilities, building blocks, and drawing insightful comparisons with previous stable Diffusion Models. My aim is to provide a clear understanding without delving too deeply into technical complexities, making this an engaging and accessible read for all. Let's get started!

Understanding Stable Diffusion: Unraveling the Magic of Text-to-Image Generation

If you feel confident about how Stable Diffusion works, or don't get involved in technical parts feel free to skip this chapter.

Stable Diffusion, the groundbreaking deep learning text-to-image model, sent shockwaves through the AI world upon its release in 2022, harnessing the power of cutting-edge diffusion techniques.

This significant development represents a notable advancement in AI image generation, potentially expanding access to high-performance models for a broader audience. The intriguing capability to transform plain text descriptions into intricate visual outputs has captured the attention of those who have experienced it. Stable Diffusion exhibits proficiency in producing high-quality images while also demonstrating noteworthy speed and efficiency, thereby increasing the accessibility of AI-generated art creation.

Stable Diffusion's training involved large public datasets like LAION-5B, leveraging a wide array of captioned images to refine its artistic abilities. However, a key aspect contributing to its progress lies in the active participation of the community, offering valuable feedback that drives the model's ongoing development and enhances its capabilities over time.

How does it work?

Let's start with main building blocks of a Stable Diffusion model and how one can train and make predictions with these models individually.

U-Net and the Essence of Diffusion Process:

To generate images using computer vision models, we venture beyond the conventional approach of relying on labeled data, such as classification, detection, or segmentation. In the realm of Stable Diffusion, the goal is to enable models to learn the complex details of images themselves, capturing complex contexts with an innovative approach known as "Diffusion."

The diffusion process unfolds in two distinctive phases:

In the first part, we take an image and introduce a controlled amount of random noise. This step is referred to as forward diffusion.
In the second part, we aim to denoise the image and reconstruct the original content. This process is known as reverse diffusion.

Noise Addition by steps. Source: Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

The first part, which involves adding Gaussian noise to the input image at each time step t, is relatively straightforward. However, the second stage poses a challenge, as directly computing the original image is not feasible. To overcome this obstacle, we employ a neural network, and this is where the ingenious U-Net comes into play.

Leveraging U-Net, we train our model to predict noise from a given randomly noised image at time step t and calculate the loss between the predicted and actual noise. With a sufficiently large dataset and multiple noise steps, the model gains the ability to make educated predictions on noise patterns. This trained U-Net model also proves invaluable for generating approximate reconstructions of images from given noise.

Reverse Diffusion. Source: Denoising Diffusion Probabilistic Models

If you are familiar with basic probability and computer vision models, this process is relatively straightforward. However, there is one more issue worth noting. Training millions of noise-added images and rebuilding them would be extremely time-consuming and would deplete computing power. To address this challenge, researchers have revisited a well-known architecture: Autoencoders. As we have already utilized a similar approach with U-Net, incorporating transpose convolutions and residual blocks, some of these elements will also prove vital role in autoencoders too.

With autoencoders, one can "encode" the data into a much smaller "latent" space and "decode" it back into the original space. Actually that's why the original Stable Diffusion paper is called Latent Diffusion. This allows us to effectively compress large images into lower dimensions.

Simple AutoEncoder representation. Image by author.

The forward and reverse diffusion operations will now occur within significantly smaller latent spaces, resulting in reduced memory requirements and substantially faster processing.

We are nearly finished with the renowned "Stable Diffusion" architecture; the only remaining part is the conditioning. Typically, this aspect is achieved using Text Encoders, though other methods using images as conditioning, such as ControlNet, exist, though it falls outside the scope of this article. Text conditioning plays a pivotal role in generating images based on text prompts, where the true magic of the Stable Diffusion model lies.

To achieve this, we can train a text embedding model like BERT or CLIP using images with captions and add token embedding vectors as conditioning inputs. Employing a cross-attention mechanism (queries, keys and values), we can map the conditional text embeddings into U-Net residual blocks. Consequently, we can incorporate image captions alongside the images themselves during the training process and effectively condition image generations based on the provided text.

Now that you know all about the building blocks of a Stable Diffusion model, armed with this knowledge, we can readily compare the previous Stable Diffusion models and make a more informed assessment of their strengths and limitations.

What's New in SDXL?

Now that we have grasped the fundamentals of SD models, let's delve into the SDXL paper to uncover the transformative changes introduced in this novel model. In summary, SDXL presents the following advancements:

Increased Number of U-Net Parameters: SDXL enhances its model capacity by incorporating a larger number of U-Net parameters, allowing for more sophisticated image generation.
Heterogeneous Distribution of Transformer Blocks: Departing from the uniform distribution of transformer blocks in previous models ([1,1,1,1]), SDXL adopts a heterogeneous distribution ([0,2,4]), introducing optimized and improved learning capabilities.
Enhanced Text Conditioning Encoder: SDXL leverages a bigger text conditioning encoder, OpenCLIP ViT-bigG, to effectively incorporate textual information into the image generation process.
Additional Text Encoder: The model employs an additional text encoder, CLIP ViT-L, which concatenates its output, enriching the conditioning process with complementary textual features.
Introducing "Size-Conditioning": A novel conditioner called "Size-Conditioning" takes the original training image's width and height as conditional input, enabling the model to adapt its image generation based on size-related cues.
"Crop-Conditioning" Parameter: SDXL introduces the "Crop-Conditioning" parameter, incorporating image cropping coordinates as conditional input.
"Multi-Aspect Conditioning" Parameter: By incorporating the bucket size for conditioning, the "Multi-Aspect Conditioning" parameter enables SDXL to cater to various aspect ratios.
Specialized Refiner Model: SDXL introduces a second SD model specialized in handling high-quality, high-resolution data; essentially, it is an img2img model that effectively captures intricate local details.

Now, let's take a closer look at how some of these additions compare to previous stable diffusion models.

Stability.ai's Official Comparison:

Let's begin by examining stability.ai's official comparison, as presented by the authors. This comparison offers valuable insights into user preferences between SDXL and Stable Diffusion. However, we must approach the findings with a cautious eye…

Comparing user preferences between SDXL and previous models. Source: Paper

This study demonstrates that participants chose SDXL models over the previous SD 1.5 and 2.1 models. In particular, the SDXL model with the Refiner addition achieved a win rate of 48.44%. It is important to note that while this result is statistically significant, we must also take into account the inherent biases introduced by the human element and the inherent randomness of generative models.

Performance Against State-of-the-Art Black-Box Models:

Currently, Midjourney is highly popular among users, with some considering it to currently be a state-of-the-art solution. According to the official survey, SDXL exhibits a higher preference rate in categories such as "Food & Beverage" and "Animals." However, in other categories like "Illustrations" and "Abstract," users still prefer Midjourney V5.1.

Source: SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Once again, we observe a similar pattern in the case of complex prompts. The paper asserts that they are preferred in 7 out of 10 complex subjects. However, without knowledge of the specific prompts used, it becomes challenging to draw a conclusion. Additionally, the lack of information regarding the prompt encoder in Midjourney further complicates matters, and only time will reveal the true preferences.

Number of U-Net Parameters

As mentioned earlier, U-Net models play a vital role in Stable Diffusion, facilitating the reconstruction of images from given noise. In SDXL, the authors have made a noteworthy improvement by incorporating a considerably larger U-Net model compared to previous versions of SD in total of 2.6B U-Net parameters compared to ~860M for its predecessors.

Although having more parameters may seem promising initially, it is essential to consider the tradeoff between complexity and quality. As the number of parameters increases, so do the system requirements for both training and generating. While it is still premature to arrive at definitive conclusions regarding the product's quality, the tradeoff between complexity and quality is already evident.

Number of Text Encoder Parameters

Indeed, when examining the total number of text encoder parameter numbers, we observe a notable increase in SDXL 1.0 compared to its predecessors. The introduction of two text conditioners in SDXL, as opposed to a single one in previous versions, accounts for this significant growth in the text encoder's parameter count. This expansion empowers SDXL to leverage a larger volume of textual information.

SDXL utilizes the OpenCLIP G/14 text encoder with 694.7 million parameters, compared to CLIP L/14 with 123.65 million parameters, resulting in a total of over 800 million parameters. This represents a substantial leap from its predecessors.

Once more, employing larger and multiple text encoders may appear appealing initially, but it introduces added complexity that could prove detrimental. Consider a scenario where you are fine-tuning a SDXL model with your own data. In such cases, determining optimal parameters becomes more challenging than previous SD models because finding the "Sweet Spot" of hyperparameters for both encoders becomes more elusive.

The Workflow

Indeed, the term "XL" in SDXL is indicative of its expanded scale and increased complexity when compared to previous SD models. SDXL surpasses its predecessors in various aspects, boasting a larger number of parameters, including two text encoders and two U-Net models – the base model and the refiner, which essentially functions as an image-to-image model. Naturally this heightened complexity of SDXL pipelines:

Usual SD 1.5 generation pipeline. Image by Author

SDXL generation pipeline. Image by Author

SDXL in Practice

The model weights of SDXL have been officially released and are freely accessible for use as Python scripts, thanks to the diffusers library from Hugging Face. Additionally, there is a user-friendly GUI option available known as ComfyUI. This GUI provides a highly customizable, node-based interface, allowing users to intuitively place building blocks of the Stable Diffusion model and visually connect them.

With this readily available implementations, users can seamlessly integrate SDXL into their projects, enabling them to harness the power of this cutting-edge latent text-to-image diffusion model.

An Example of ComfyUI workflow pipeline. Image by author.

Current State of SDXL and Personal Experiences

While the new features and additions in SDXL appear promising, some fine-tuned SD 1.5 models are still delivering better results. This outcome is primarily attributed to the great support from the thriving community – an advantage that stems from the open-source approach.

At its initial model stage, SDXL exhibits improvements over 1.5, and I am confident that with continued community support, its performance will only grow stronger in the future. However, it is essential to acknowledge that as models become more complex, using and fine-tuning them computationally demands greater resources. But there's no need for concern yet…

LoRAs (Locally Rank-Adaptive Decompositions) has gained popularity for fine-tuning Large Language Models. This approach involves adding pairs of rank-decomposition weight matrices to existing weights and training only these newly added weights. As a result, training becomes faster and computationally more efficient. The incorporation of LoRAs is expected to pave the way for the community to create even better custom versions in the near future. Notably, SDXL already fully supports LoRAs.

Despite the positive developments, it's worth noting that SDXL still grapples with some of the usual Stable Diffusion shortcomings, as officially acknowledged by the authors:

The model does not achieve perfect photorealism
The model cannot render legible text
The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to "A red cube on top of a blue sphere"
Faces and people in general may not be generated properly.
The autoencoding part of the model is lossy.

Personal Observations

Personally, while experimenting with the SDXL model, I still find myself favoring the previous SD 1.5 community checkpoints in certain cases. With months of community support, it is relatively easy to find the right fine-tuned model that suits specific needs, such as photorealism or more cartoonish styles. However, it is currently challenging to find specific fine-tuned models for SDXL due to the high computing power requirements. Nevertheless, the base model of SDXL appears to perform better than the base models of SD 1.5 or 2.1 in terms of image quality and resolution, and with further optimizations and time, this might change in the near future.

It's worth noting that with the increased model size, some users have reported difficulties running the model on their everyday laptops or PCs, which is unfortunate. I am hopeful that the quantization techniques commonly used in Large Language Models may find their place in this field too.

Additionally, with the change in text encoders, my usual prompts no longer yield satisfactory results on SDXL. Although the developers of SDXL claim that prompting is easier with SDXL, I have yet to find the right approach myself. It may take some time to adjust to the new prompt style, especially for those coming from previous versions.

Conclusion

In our article, we discovered the capabilities of Stable Diffusion XL, a model with the ability to transform plain text descriptions into intricate visual representations. We found that SDXL's open-source nature and its approach to addressing concerns related to black-box models have contributed to its widespread appeal, allowing it to reach a broader audience.

With its increased number of parameters and extra features SDXL has proven itself as an "XL" model, boasting heightened complexity compared to its predecessors.

Implementing SDXL in practice has been made easier with the official release of its model weights, freely available as Python scripts from huggingface, along with the user-friendly ComfyUI GUI option.

While SDXL shows great promise, the journey towards perfection is still ongoing. Some finely-tuned SD 1.5 models continue to outperform SDXL in certain scenarios, thanks to the vibrant community support. However, with active engagement and support, I can see that SDXL will continue to evolve and improve over time.

Nonetheless, it is important to acknowledge that like its predecessors, SDXL does have some limitations. Achieving perfect photorealism, rendering legible text, handling compositionality challenges, and accurately generating faces and people are among the areas that the model is needs to be improved upon.

In conclusion, SDXL 1.0 represents a significant leap forward in text-to-image generation, unleashing the creative potential of AI and pushing the boundaries of what's possible. As the AI community continues to collaborate and innovate, we can look forward to witnessing even more astonishing developments in the fascinating world of SD models and beyond.