Pre-Training Context is All You Need
Generative Artificial Intelligence and its popular transformer models are advertised everywhere these days and new models are being released every hour (see the inflation of AI). In this rapidly evolving field of AI, the possibilities of values these models could bring seem to be endless. Large Language Models (LLM) like chatGPT already made it into every Engineers' pile of resources, writers use them to support their articles, and designers create the first visuals or seek inspiration from the outcome of computer vision models.
If it is not magic, what really powers these impressive transformer models?
However, even though the achievements and usefulness are great and generative AI enhances productivity, it is important to recall that modern Machine Learning models (like LLMs or VisionTransformers) are not performing any magic at all (similar to the fact that ML, or statistical models in general, never have been magical). Even though the remarkable abilities of models might be perceived as magic-like and some experts in the field even talk about things like hallucinations of models, still, the foundation of every model is just math and statistical probabilities (sometimes complex, but still math). This leads to the fundamental question: If it is not magic, what really powers these impressive transformer models?

The Fundament of Every Model is Data
As with any model (statistical or ML), it is the training data that has the largest impact on the later model performance. If you don't have a high volume of quality data reflecting the relationships you would like the model to learn, there is nothing to train on and the resulting model will perform poorly (the famous GIGO principle: Garbage In Garbage Out). This fundamental principle of data modeling has not changed at all over the years. Behind every revolutionary new transformer model stands first of all just one thing: data. It is the amount, quality, and context of that data that will drive the subsequent performance of the model. Recent studies (see further below) support this by showcasing that the latest generative AI models generalize well when the provided context is part of the training distribution but poorly for out-of-distribution learning.
In-Distribution vs. Out-Of-Distribution Learning
It is important to keep in mind that a model is nothing else than a huge network, tree, or graph of relationships. What an ML model basically learns is how to transform a given input into a desired output (see Figure 2).

When the model is trained (or in other words: when those relationships are updated), the context of the input and its informativeness of the output will define what the model is good at. Similar to humans being good at responding to questions in their native language, ML models are good at responding to input data they have seen a lot. That is called in-distribution Learning. If, during training, the model has been provided with large amounts of rich context, it can rely on this acquired knowledge later and the resulting predictions show an accurate performance.
Out-of-distribution learning, however, describes the situation where a model is supposed to predict based on context that it has not seen before. You can picture a human who never learned Norwegian suddenly responding to a question asked in Norwegian. Please inspect Figure 3 for an overview of in- and out-of-distribution learning.

The impressive performance of modern LLMs and other ML models comes from a vast volume and amount of context in the original training data. Due to this extensive pre-training of models, the range of questions that can fall inside of in-distribution learning is huge. That allows a model to have answers to a variety of questions and that might appear to a user as magical or as human-level intelligence, but it is not. Similarly, a wrong or unexpected answer by the model is also not a true hallucination, it basically highlights context gaps in the original training data and, thus, leads to out-of-distribution learning. In general, machine learning models are very limited in their out-of-distribution learning capabilities, requiring extensive training for foundational models.
The Power of Pretraining in Language Models
In a recent paper by Google DeepMind members, the authors strengthen the argument that the in-context learning performance of modern LLMs is mostly derived from their pre-training distribution. The paper "Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models" by Steve Yadlowsky, Lyric Doshi, and Nilesh Tripuraneni (2023) focuses on how modern transformer models, acquire their impressive in-context learning abilities (their abilities to have answers for any context prompted to them).
Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
The findings are very insightful. When transformer models are pre-trained on data covering a wide range of contexts, they demonstrate an impressive performance in learning new tasks that fall within the pre-training context. This capability is near-optimal, showcasing an impressive degree of generalization and adaptability within the training distribution. However, when these models encounter context outside of their pre-training domain, the performance is limited and failures occur. This showcases a reduced generalization and clear limitations for out-of-distribution contexts.
Vision Transformers: A Case Study in Scale
In another study (also by Google DeepMind in 2023) with the title: "ConvNets Match Vision Transformers at Scale" the authors Samuel L. Smith, Andrew Brock, Leonard Berrada, and Soham De challenge a widespread belief for computer vision that, at scale, modern Vision Transformer models outperform traditional models like Convolutional Neural Networks (CNNs). The study trains both CNNs and Vision Transformers with a similar computing budget and compares their performance.
The results indicate a scaling law between the compute budget used for pretraining and the subsequent performance. After fine-tuning on ImageNet, the pre-trained CNNs matched the performance of Vision Transformers at comparable budgets.
Summary
Together, these two studies show an interesting picture of the impressive performance of modern Transformer models. First, the performance is not just driven by the model architecture, it is more driven by the amount of pre-training conducted. Second, when the pre-training context covers a wide range, the resulting model will also show a wide range of in-context learning capabilities.
These studies underscore the critical principle that volume, quality, and context of the training data is the most essential part of any foundational ML model. Without knowing the context covered by the pre-training it is hard to determine upfront the areas in which a model will perform well. Benchmark tests can help indicate potential context caps. Those tests do not showcase how good a model performs in general, they mostly showcase which context has been part of the model's training distribution.
In conclusion, as it is the age of AI and the number of Data Scientists and Engineers developing ML models is increasing, it is becoming even more evident that pre-training with a wide range of contexts isn't just a part of the process; in many ways, it's all you need.
All images, unless otherwise noted, are by the author.
Please check out my profile page, follow me, or subscribe to my email list if you would like to know what I write about or if you want to be updated on new stories.