Can Generative AI Lead to AI Collapse?

Author:Murphy | View: 20560 | Time: 2025-03-22 20:37:49

|LLM|GENERATIVE AI|MODEL COLLAPSE|

"Civilizations die from suicide, not by murder." – Arnold Toynbee

Large language models (LLMs) are generally trained in an unsupervised manner on a huge amount of text. This text is obtained by crawling the Internet. This text was written by humans, however, that may soon not be the case.

A Requiem for the Transformer?

LLMs are data-hungry by definition, and the datasets used are getting bigger and bigger. According to the scaling law [2] to improve performance one must increase both the number of parameters and the number of training tokens (with the latter considered the most important factor).

These datasets contain data produced by humans, however, some studies show that this is a limited resource. Humans also do not produce data at the same scale as we do, as we are increasing consumption through LLM training. One study, recently published, argues that we cannot support scaling beyond this decade [3].

Say Once! Repeating Words Is Not Helping AI

With the advent of ChatGPT and open-source models, the amount of text being generated by Artificial Intelligence models is growing. For example, a recently published study [1] shows that with the availability of low-cost machine translation (MT), content on the Web is often quickly translated into many languages using MT algorithms.

Machine generated, multi-way parallel translations not only dominate the total amount of translated content on the web in lower resource languages where MT is available, it also constitutes a large fraction of the total web content in those languages. – source

However, this leads to several problems:

These translated contents present several biases and have a different topic distribution (low quality and indication that they are produced only to generate ad revenue).
The more languages they are translated into, the lower the average quality

The amount of text produced by AI is increasing in all fields (internet, scientific articles, school students) and is becoming increasingly difficult to identify [4–6]. If the models of the future are trained with text scraped from the Web, they will inevitably be trained with data produced by their own predecessors.

What happens when a model is trained using AI-generated text? What happens if most of the text is produced by ChatGPT?

According to a recent article published in Nature, this leads to model collapse [7]. Model collapse is a degenerative process in which its performance degrades to produce error and becomes useless. From a statistical point of view, this has been described in two stages:

Early model collapse, where the model begins to lose information about the tails of the distribution.
Late model collapse, where the model converges to a distribution that is completely different from the original distribution, and thus no longer produces anything useful.

Previously it was shown that a model cannot be trained in a self-training loop (after the first iteration with real data the model is trained with self-generated data). Using data generated by the model itself leads to the collapse of the system.

Collapse of a model. Image adapted from: [8]

As seen above, the model first begins to deviate from the training data by forgetting elements of the original data and underrepresented classes (early collapse) and then instead becomes unable to produce meaningful data (late collapse).

Thus studies have shown cases where continual learning from generated data (or data poisoned with generated data) led the model to collapse. So some authors have warned that this explosion of data generated on the Internet may lead to collapse:

Model collapse warns that democratizing access to generative models runs the risk of polluting the very data necessary to train future iterations of generative models. – source, [9]

So far, however, we have had neither a rigorous description of the problem by textual models nor the causes of this collapse. For this study [7], three errors cause collapse when AI-generated data are present:

Statistical approximation error. The initial data is finite, but it is dispersed once the data for training tends to infinity, so information begins to be lost with each step of further training.
Functional expressivity error. The transformer has an expressivity limit, so there is some error in approximating the initial distribution.
Functional approximation error. this error comes from the learning procedures as the structural bias of stochastic gradient descent.

Each of these causes model collapse, and their effect is compounding over generations.

In this paper, the authors [7] take a pre-trained model and conduct fine-tuning on a dataset. This is a common use of an Llm (especially since training a model from scratch is too expensive). What the authors test is what happens if this fine-tuning dataset is generated from another fine-tuned model. Taking a model from HuggingFace, the authors finetuned it with the wikitext2 dataset, evaluated it on the test set, and then used it to generate data, thus generating an artificial dataset. The model was then iteratively trained on an artificial dataset.

By training the model for five epochs, it is seen that gradually the model deteriorates its performance. Gradually, a long tail appears in the generated examples that are the product of the errors introduced by the examples generated by other models.

The authors note that by keeping a certain percentage of data from the original dataset this degeneration is reduced. Models trained on generated data can learn some of the original tasks but with greater errors (as indicated by increased perplexity). For the authors, the model begins to collapse as samples with low perplexity accumulate over the generations (thus a compounding effect). Continuing with the cycles, this effect will then lead to the ultimate collapse of the model.

Under inspection, the model begins to generate those examples that the original model would produce with a higher likelihood. This effect is consistent with the fact that during training the model loses some of its knowledge if it is not repeated (starting with that rarer knowledge). After an example is seen by the model, the knowledge of that example is maximized, but then that knowledge is fading [10]. Therefore, continuing to train the model begins to lose the rare knowledge first, and produce only examples with maximum likelihood.

An LLM Student's Handbook: Mastering the Art of Learning and Retaining Knowledge

When a model is trained with AI-generated content in the dataset it learns to generate only well-known concepts, phrases, and tones. At the same time, it forgets those ideas and concepts that are less common in the dataset. This leads in the long run to the model collapsing.

What does model collapse imply for the LLMs of the future?

Long-term poisoning attacks on language models are not new. For example, we saw the creation of click, content and troll farms, a form of human ‘language models', whose job is to misguide social networks and search algorithms. – source

For the time being, the creation of this content has primarily impacted search engines. Most of this content was generated to rank high in search engines and monetize displays. Google has tried to limit this phenomenon by assigning a lower value to these sites in its algorithms. This does not solve the problem, as new methods are found to avoid these countermeasures.

Typically the datasets to train LLMs are obtained automatically and much of the data generated may also be on reputable sites. This means that in the future these data may enter the training sets in large quantities. Model collapse not only has an effect on performance but also an effect on the fairness of the algorithms. Models quickly forget underrepresented knowledge (even before seeing a marked effect on performance) which means an impact on minorities and marginalized groups.

Watermark is not a solution. First, watermarks can be removed (it was shown with the generated images). second, the models to detect generated text are not as accurate and can be fooled easily. Third, companies are not going to share information about their watermarks (so as not to facilitate competitors coaching a model). Finally, with open-source models much generated text will not have a watermark anyway.

Removing invisible watermarks. image source: [11]

Companies that are training models now or preserving data before the flood of generated text have an advantage over the competition. In general, data quality is critical, and owning data generated by real people will be a great asset for those who own it. Alternatively, it would take a coordinated effort to identify the origin of the text.

What are your thoughts on this? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Beyond Human Feedback: How to Teach a Genial AI Student

Expanding Language, Expanding Thought: Vocabulary Size in LLM Scaling

Past Imperfect: Jailbreaking LLMs with Past Tense Requests

Graph ML: Introduction to Python iGraph

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Thompson, 2024, A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism, link
Kaplan, 2020, Scaling Laws for Neural Language Models, link
Villalobos, 2024, Will we run out of data? Limits of LLM scaling based on human-generated data, link
Wired, 2024, Students Are Likely Writing Millions of Papers With AI, link
Akram, 2023, An Empirical Study of AI Generated Text Detection Tools, link
Lu, 2023, Large Language Models can be Guided to Evade AI-Generated Text Detection, link
Shumailov, 2024, AI models collapse when trained on recursively generated data, link
Martinez, 2023, Combining Generative Artificial Intelligence (AI) and the Internet: Heading towards Evolution or Degradation? link
Gertgrasser, 2024, Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data, link
Chang, 2024, How Do Large Language Models Acquire Factual Knowledge During Pretraining? link
Zhao, 2023, Invisible Image Watermarks Are Provably Removable Using Generative AI, link