Related papers: How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

URL: http://arxiv.org/abs/2404.05090v1
Date: Sun, 7 Apr 2024 22:15:13 GMT
Title: How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse
Authors: Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah,
Abstract summary: Model collapse occurs when new models are trained on synthetic data generated from previously trained models. We show that model collapse cannot be avoided when training solely on synthetic data. We estimate a maximal amount of synthetic data below which model collapse can eventually be avoided.
Score: 9.59833542807268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows us to characterize the impact of various recursive training scenarios. Specifically, we demonstrate that model collapse cannot be avoided when training solely on synthetic data. However, when mixing both real and synthetic data, we provide an estimate of a maximal amount of synthetic data below which model collapse can eventually be avoided. Our theoretical conclusions are further supported by empirical validations.

Related papers

Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence [31.751930228965467]
We investigate ways to modify this synthetic retraining process to avoid model collapse.<n>Our key finding is that by injecting information through an external synthetic data verifier, synthetic retraining will not cause model collapse.
arXiv Detail & Related papers (2025-10-18T22:39:39Z)
When Models Don't Collapse: On the Consistency of Iterative MLE [34.99810116340191]
We study model collapse for maximum likelihood estimation (MLE)<n>We establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes.<n>We prove that some assumptions (beyond MLE consistency) are indeed necessary.
arXiv Detail & Related papers (2025-05-25T08:50:46Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs) Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Golden Ratio Weighting Prevents Model Collapse [9.087950471621653]
Recent studies identified a phenomenon in generative model training known as model collapse. We investigate this phenomenon theoretically within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data.
arXiv Detail & Related papers (2025-02-25T10:15:16Z)
Rate of Model Collapse in Recursive Training [13.722324504719282]
We ask how fast model collapse occurs for some well-studied distribution families under maximum likelihood (ML or near ML) estimation. Surprisingly, even for fundamental distributions such as discrete and Gaussian distributions, the exact rate of model collapse is unknown. Our results show that for discrete distributions, the time to forget a word is approximately linearly dependent on the number of times it occurred in the original corpus.
arXiv Detail & Related papers (2024-12-23T15:21:50Z)
How to Synthesize Text Data without Model Collapse? [37.219627817995054]
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. We propose token editing on human-produced data to obtain semi-synthetic data.
arXiv Detail & Related papers (2024-12-19T09:43:39Z)
Universality of the $π^2/6$ Pathway in Avoiding Model Collapse [0.0]
We demonstrate the universality of the pi-squared-over-6 augment risk bound across a large family of canonical statistical models. We provide a framework that is able to accommodate a large variety of augment processes.
arXiv Detail & Related papers (2024-10-30T08:44:10Z)
Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences [20.629333587044012]
We study the impact of data curation on iterated retraining of generative models. We prove that, if the data is curated according to a reward model, the expected reward of the iterative retraining procedure is maximized.
arXiv Detail & Related papers (2024-06-12T21:28:28Z)
Heat Death of Generative Models in Closed-Loop Learning [63.83608300361159]
We study the learning dynamics of generative models that are fed back their own produced content in addition to their original training dataset. We show that, unless a sufficient amount of external data is introduced at each iteration, any non-trivial temperature leads the model to degenerate.
arXiv Detail & Related papers (2024-04-02T21:51:39Z)
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data [49.73114504515852]
We show that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse. We demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse.
arXiv Detail & Related papers (2024-04-01T18:31:24Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z)
Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective. Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification) We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs. Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z)
Variational Bayesian Unlearning [54.26984662139516]
We study the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased. We show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief. In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging.
arXiv Detail & Related papers (2020-10-24T11:53:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.