From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources
- URL: http://arxiv.org/abs/2602.10531v2
- Date: Tue, 17 Feb 2026 23:36:41 GMT
- Title: From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources
- Authors: Soham Bakshi, Sunrit Chakraborty,
- Abstract summary: This paper looks at the problem of model collapse from a statistical viewpoint.<n>We consider iterative training on samples sourced from a mixture of the true target and synthetic distributions.<n>With non-token mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner can avoid collapse.
- Score: 2.8647133890966994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
Related papers
- Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study [27.894241484593735]
Recursively training on synthetic data has been observed to significantly degrade performance in a wide range of tasks.<n>We theoretically analyze this phenomenon in the setting of score-based diffusion models.
arXiv Detail & Related papers (2026-02-18T16:56:36Z) - Golden Ratio Weighting Prevents Model Collapse [7.512957145774808]
We develop an optimal training strategy for integrating real and synthetic data.<n>We characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model's performance.<n>In some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio.
arXiv Detail & Related papers (2025-02-25T10:15:16Z) - Provable Statistical Rates for Consistency Diffusion Models [87.28777947976573]
Despite the state-of-the-art performance, diffusion models are known for their slow sample generation due to the extensive number of steps involved.
This paper contributes towards the first statistical theory for consistency models, formulating their training as a distribution discrepancy minimization problem.
arXiv Detail & Related papers (2024-06-23T20:34:18Z) - Ablation Based Counterfactuals [7.481286710933861]
Ablation Based Counterfactuals (ABC) is a method of performing counterfactual analysis that relies on model ablation rather than model retraining.
We demonstrate how we can construct a model like this using an ensemble of diffusion models.
We then use this model to study the limits of training data attribution by enumerating full counterfactual landscapes.
arXiv Detail & Related papers (2024-06-12T06:22:51Z) - How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse [9.59833542807268]
Model collapse occurs when new models are trained on synthetic data generated from previously trained models.
We show that model collapse cannot be avoided when training solely on synthetic data.
We estimate a maximal amount of synthetic data below which model collapse can eventually be avoided.
arXiv Detail & Related papers (2024-04-07T22:15:13Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution.
Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z) - One for More: Selecting Generalizable Samples for Generalizable ReID
Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function.
Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Improving Maximum Likelihood Training for Text Generation with Density
Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation.
Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.