Model Collapse and Data Quality

Sneiderman, Robby

AI Safety

Model Collapse and Data Quality

When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.

AdvancedTier 2FrontierFrontier watch~45 min

Prerequisites

Synthetic Data Generation

Prereq Map

Why This Matters

As LLMs produce an increasing fraction of text on the internet, future pretraining datasets will inevitably contain AI-generated content. If models are trained on the outputs of previous models, the resulting distribution shifts in predictable and harmful ways: variance decreases, tails disappear, minority modes vanish. This is model collapse.

This is not a hypothetical concern. Web crawls from 2024 onward contain substantial AI-generated text. Any pretraining pipeline that ingests web data without careful filtering is at risk.

Mental Model

Imagine a game of telephone where each participant is a language model. The first model learns from real human data. The second model learns from the first model's outputs. The third learns from the second's outputs. At each step, the learned distribution becomes a noisy approximation of the previous one. Rare events (unusual phrasing, minority viewpoints, tail-distribution examples) get progressively smoothed out because the approximation concentrates around the mode.

Formal Setup

Let $p_0$ be the true data distribution (human-generated text). A model $M_0$ trained on samples from $p_0$ learns an approximation $\hat{p}_0$ . Model $M_1$ is trained on samples from $\hat{p}_0$ , learning $\hat{p}_1$ . In general, $M_k$ is trained on samples from $\hat{p}_{k-1}$ .

Definition

Model Collapse

Model collapse is the progressive degradation of the learned distribution $\hat{p}_k$ as $k$ increases in an iterative training loop where each generation of model trains on the outputs of the previous generation. The distribution narrows, variance decreases, and support shrinks.

Definition

Iterative Retraining

In iterative retraining, generation $k$ of a model is trained on data sampled from generation $k-1$ :

$\hat{p}_k = \text{Train}(\{x_i\}_{i=1}^{n_k}), \quad x_i \sim \hat{p}_{k-1}$

This models the scenario where AI-generated text progressively replaces human-generated text in training corpora.

Main Theorems

Theorem

Variance Decay Under Iterative Retraining

Statement

Suppose $p_0 = \mathcal{N}(\mu, \sigma^2)$ and each generation fits a Gaussian via MLE on $n$ samples from the previous generation. After $k$ generations, the expected variance of $\hat{p}_k$ satisfies:

$\mathbb{E}[\hat{\sigma}_k^2] = \sigma^2 \cdot \left(\frac{n-1}{n}\right)^k$

As $k \to \infty$ , $\hat{\sigma}_k^2 \to 0$ . The distribution collapses to a point mass.

Intuition

Each generation estimates the variance from a finite sample, which systematically underestimates the true variance (by the factor $(n-1)/n$ from the MLE bias). This bias compounds across generations. With infinite samples ( $n \to \infty$ ), each generation would recover the previous distribution exactly and no collapse would occur. Finite sampling is the root cause.

Proof Sketch

The MLE variance estimator from $n$ samples has expectation $\frac{n-1}{n} \sigma_{\text{true}}^2$ . At generation $k$ , the true variance being estimated is $\hat{\sigma}_{k-1}^2$ , so in expectation $\hat{\sigma}_k^2 = \frac{n-1}{n} \hat{\sigma}_{k-1}^2$ . Iterating gives the geometric decay $\left(\frac{n-1}{n}\right)^k$ .

Why It Matters

This is the simplest demonstration that iterative retraining on synthetic data causes systematic quality loss. The Gaussian case is exactly solvable, but the phenomenon, that finite-sample estimation errors compound across generations, applies far more broadly. Shumailov et al. (2024) confirm the same pattern empirically in language models and diffusion models.

Failure Mode

The Gaussian analysis understates the problem. In higher dimensions and with more complex distributions, the modes of the distribution can disappear entirely (not just shrink), because the model fails to generate enough samples from rare modes for the next generation to learn them.

report a correction →

Proposition

Tails Vanish Under Iterative Retraining

Statement

Let $p_0$ be a mixture of $m$ Gaussians with weights $w_1, \ldots, w_m$ . Under iterative retraining with $n$ samples per generation, a component with weight $w_j$ is represented in the sample with probability $1 - (1 - w_j)^n$ . After $k$ generations, the probability that a minority component (small $w_j$ ) survives in the training data decays exponentially in $k$ . Components with $w_j \ll 1/n$ vanish within a few generations.

Intuition

If a mixture component has weight 1%, and you draw 100 samples, you expect only 1 sample from that component. In the next generation, the model trained on those samples may assign even less weight to that component. Within a few generations, the component produces zero samples and is permanently lost from the data distribution.

Why It Matters

Tail disappearance means that rare but valid text (minority dialects, specialized technical content, unusual creative writing) is systematically removed from the training distribution across generations. The resulting models produce more homogeneous, more "average" text with less diversity.

Failure Mode

This analysis assumes independent sampling at each generation. In practice, deduplication and filtering steps may accelerate tail loss. Conversely, targeted oversampling of rare content can slow it.

report a correction →

Empirical Evidence

Shumailov et al. (2024) demonstrated model collapse empirically across multiple architectures:

Language models (OPT-125M): After 5 generations of iterative retraining, perplexity increased and text diversity (measured by self-BLEU) decreased. The models produced increasingly repetitive output.
Variational autoencoders: On MNIST, iterative retraining caused the model to lose minority digit classes within 5-10 generations.
Gaussian mixtures: The number of recovered modes decreased monotonically with generation count.

The pattern is consistent: iterative retraining degrades quality, reduces diversity, and eliminates tails.

Mitigation Strategies

Data provenance tracking. Label data as human-generated or AI-generated at the point of creation. Maintain metadata throughout the data pipeline. Prioritize human-generated data in pretraining mixtures.

Decontamination. Use classifiers (GPTZero, DetectGPT, watermark detectors) to identify and remove AI-generated content from training corpora. This is imperfect because detection accuracy degrades as models improve.

Maintaining human data sources. Preserve access to pre-LLM web crawls (Common Crawl snapshots from before 2022). Curate high-quality human-written datasets (books, peer-reviewed papers, pre-LLM Wikipedia). Weight these sources more heavily in the training mixture.

Mixing real and synthetic data: replace vs accumulate. A key distinction separates two data regimes. Shumailov et al. (2024) use replace semantics: each generation discards prior data and trains only on the previous generation's samples. This is the regime where collapse is most severe. Gerstgrasser et al. (2024, arXiv:2404.01413) show that under accumulate semantics, where each generation keeps all historical real data and appends new synthetic data, collapse is prevented entirely: the loss does not diverge across generations. The practical mitigation is to maintain a minimum fraction $\alpha$ of real data in every training batch and, where possible, accumulate rather than replace. Under accumulate semantics, even modest fractions of real data (for example $\alpha = 10\%$ ) substantially slow or halt collapse. The attribution matters: Shumailov is the canonical collapse result, Gerstgrasser is the canonical mitigation result.

Data diversity enforcement. During synthetic data generation, use temperature scaling, nucleus sampling, or explicit diversity objectives to ensure the generated data covers the full distribution, not just the mode.

Watermarking for downstream filtering. Text watermarking embeds a statistical signal into model outputs at sampling time so that AI-generated text can be detected later. Kirchenbauer et al. (2023, arXiv:2301.10226) partition the vocabulary into a pseudo-random green list and red list at each decoding step (seeded by the preceding token) and softly bias sampling toward the green list. The resulting bias is undetectable to readers but statistically detectable with a one-sided z-test on token frequencies. Christ, Gunn, and Zamir (2023, arXiv:2306.09194) strengthen this to cryptographically undetectable watermarks: the watermarked distribution is computationally indistinguishable from the unwatermarked one, yet detection remains possible with a secret key. If watermarking is adopted across major providers, training pipelines can strip watermarked text from web crawls, which directly reduces the synthetic fraction of the data distribution and slows collapse.

Scaling-Law Interactions

Dohmatob, Feng, Yang, Charton, and Kempe (2024, arXiv:2404.05090) show that training on synthetic data changes the shape of neural scaling laws, not just their constants. Standard scaling laws predict that loss decays as a power law in model size and data quantity. Under iterative retraining, the observed loss eventually bends away from the power law and approaches a positive asymptotic floor: adding more compute or more synthetic data stops helping. The authors derive this as a change in the bias-variance decomposition, where the bias floor is set by the cumulative information loss of prior generations. The practical implication is that the scaling frontier depends on data provenance. A pipeline that accumulates synthetic data faster than it accumulates real data will hit a plateau earlier than the clean-data scaling law would suggest, independent of model size.

Common Confusions

Watch Out

Replace versus accumulate: the regime matters more than the mix fraction

Two different data regimes give very different predictions, and the literature sometimes collapses them into one "synthetic data is bad" claim. Under replace (Shumailov et al. 2024, Nature): generation $k$ sees only samples from generation $k-1$ and discards all prior real data. This is where the $\left(\frac{n-1}{n}\right)^k$ variance decay bites, tails vanish within a few generations, and loss diverges. Under accumulate (Gerstgrasser et al. 2024, arXiv:2404.01413): generation $k$ sees the union of all prior real data plus all synthetic data produced so far. Gerstgrasser et al. show, both analytically for Gaussians and empirically for language and diffusion models, that accumulate keeps the loss bounded across generations and collapse does not occur. The practical implication: a pipeline that discards old data each iteration is at risk, even with a small synthetic fraction; a pipeline that appends is much safer, even with a large synthetic fraction. When you read a "model collapse" claim, ask which regime the experiment used. Bertrand et al. (2023, arXiv:2310.00429) give a complementary stability analysis, showing a phase transition in the real-to-synthetic ratio below which iterative retraining remains stable.

Watch Out

The 10 percent real-data rule is a heuristic for accumulate, not a guarantee

It is often quoted that "mixing 10 percent real data prevents collapse". This is approximately true under accumulate semantics and for the Gaussian-MLE model; it is not a distribution-free theorem. Under replace, a fixed 10 percent real fraction does not prevent collapse in general: the synthetic 90 percent continues to narrow across generations, and what survives in the joint distribution depends on how the fixed human data interacts with the contracting synthetic part. Treat the 10 percent number as a calibration point from Shumailov-style experiments, not as a safe operating threshold for arbitrary data pipelines.

Watch Out

Model collapse is not catastrophic forgetting

Catastrophic forgetting occurs when a model trained on task A loses performance on task A after fine-tuning on task B. Model collapse is a different phenomenon: the model's training data distribution narrows across generations, even when the task stays the same. The cause is iterative retraining on synthetic data, not task switching.

Watch Out

Model collapse does not require the same model

The collapse occurs even when different architectures are used across generations. What matters is that each generation trains on the previous generation's outputs. The distribution narrowing is a property of the data pipeline, not any specific model architecture.

Watch Out

Model collapse is not mode collapse

Model collapse and mode collapse are distinct phenomena that share a misleadingly similar name. Model collapse (Shumailov et al. 2024) is the iterative-retraining distribution degeneration studied here: across multiple generations of models, each trained on the previous generation's outputs, the learned distribution narrows and tails vanish. Mode collapse (Goodfellow 2016, Deep Learning Ch. 20; Salimans et al. 2016, arXiv:1606.03498) is a single-training-run failure mode of GANs, where the generator learns to produce only a subset of the modes of the target distribution because the discriminator cannot penalize lack of diversity. Mode collapse is about one model failing to cover its training distribution. Model collapse is about a chain of models failing to preserve a distribution across generations.

Watch Out

Small amounts of synthetic data do not cause collapse

Using synthetic data to augment a primarily human-generated training set is different from iterative retraining. A single generation of synthetic data mixed with real data does not produce collapse. The problem requires multiple generations where each generation's output becomes the next generation's input.

Summary

Iterative retraining on synthetic data causes variance decay at rate $(1 - 1/n)^k$
Tails and minority modes vanish within a few generations
The root cause is finite-sample estimation error compounding across generations
Mitigation: data provenance, decontamination, preserving human data sources
Even 10% real data in the training mix substantially slows collapse
This is a systemic risk as AI-generated text saturates the web

Exercises

ExerciseCore

Problem

A Gaussian distribution $\mathcal{N}(0, 100)$ undergoes iterative retraining with $n = 1000$ samples per generation. After $k = 100$ generations, what is the expected variance? After how many generations does the expected variance drop below 50?

ExerciseCore

Problem

A mixture source has three components with weights $w_1 = 0.6$ , $w_2 = 0.39$ , $w_3 = 0.01$ . Under iterative retraining with $n$ samples per generation, the minority component survives one generation with probability $1 - (1 - w_3)^n$ . (a) How large must $n$ be so the minority component survives a single generation with probability at least 0.95? (b) If $n = 200$ and the probability of survival per generation is roughly constant, what is the expected number of generations before the minority component is lost in at least one sampling step?

ExerciseAdvanced

Problem

A training corpus is a mixture of 95% AI-generated text and 5% human-generated text. You train a model on this corpus, then use that model to generate the AI portion of the next corpus (keeping the 5% human data fixed). Model the AI-generated text as drawn from the model's learned distribution. After 10 generations, qualitatively describe what happens to the overall distribution. Does the 5% human data prevent collapse?

Further directions

Information-theoretic analysis: rate-distortion perspective on collapse
Model collapse in RLHF and post-training (does preference data collapse too?)
Curriculum effects on collapse (does mixing order matter?)
Human-AI hybrid data generation pipelines
Data provenance at web scale (C4, FineWeb filtering)
Empirical case studies: what does collapse look like on real industrial pipelines?

References

Canonical:

Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., Anderson, R. "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2024, Nature).
Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Sleight, H., Hughes, J., Korbak, T., Agrawal, R., Pai, D., Gromov, A., Roberts, D., Yang, D., Donoho, D., Koyejo, S. "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data" (2024, arXiv:2404.01413).
Goodfellow, I., Bengio, Y., Courville, A. Deep Learning, Ch. 20 "Deep Generative Models" (2016, MIT Press). Canonical treatment of mode collapse in GANs.

Current:

Alemohammad et al., "Self-Consuming Generative Models Go MAD" (2023, ICML).
Dohmatob, E., Feng, Y., Yang, P., Charton, F., Kempe, J. "A Tale of Tails: Model Collapse as a Change of Scaling Laws" (2024, arXiv:2404.05090).
Bertrand, Q., Bose, A.J., Duplessis, A., Jiralerspong, M., Gidel, G. "On the Stability of Iterative Retraining of Generative Models" (2023, arXiv:2310.00429). Analytic conditions under which iterative retraining stays stable.
Briesch et al., "Large Language Models Suffer From Their Own Output" (2023).
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X. "Improved Techniques for Training GANs" (2016, arXiv:1606.03498). Mode collapse and mitigations.
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., Goldstein, T. "A Watermark for Large Language Models" (2023, arXiv:2301.10226).
Christ, M., Gunn, S., Zamir, O. "Undetectable Watermarks for Language Models" (2023, arXiv:2306.09194).

Next Topics

The natural next steps from model collapse:

Data contamination and evaluation: detecting AI-generated content in benchmarks

Last reviewed: April 24, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Synthetic Data Generationlayer 3 · tier 2

Derived topics

1

Data Contamination and Evaluationlayer 5 · tier 2

Graph-backed continuations

Data Contamination and Evaluation