A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective
- URL: http://arxiv.org/abs/2509.16499v2
- Date: Wed, 01 Oct 2025 16:21:12 GMT
- Title: A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective
- Authors: Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu,
- Abstract summary: diffusion models have led to an abundance of AI-generated data, raising concerns about model collapse.<n>This paper identifies a transition from generalization to memorization during model collapse in diffusion models.<n>Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization.
- Score: 31.227375452793634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse -- a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.
Related papers
- Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence [31.751930228965467]
We investigate ways to modify this synthetic retraining process to avoid model collapse.<n>Our key finding is that by injecting information through an external synthetic data verifier, synthetic retraining will not cause model collapse.
arXiv Detail & Related papers (2025-10-18T22:39:39Z) - ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models [13.096745830570944]
We identify model overconfidence in their self-generated data as a key driver of collapse.<n>We introduce a novel loss function we call Truncated Cross Entropy (TCE)<n>These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models.
arXiv Detail & Related papers (2025-09-10T20:06:51Z) - A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z) - Learning by Surprise: Surplexity for Mitigating Model Collapse in Generative AI [1.6545633988217645]
As synthetic content infiltrates the web, generative AI models may be retrained on their own outputs.<n>This leads to model collapse: a progressive loss of performance and diversity across generations.<n>We introduce new measures that characterise collapse directly from a model's next-token probability distributions.
arXiv Detail & Related papers (2024-10-16T08:02:48Z) - Model Collapse in the Self-Consuming Chain of Diffusion Finetuning: A Novel Perspective from Quantitative Trait Modeling [10.159932782892865]
This paper examines Chain of Diffusion, where a pretrained text-to-image diffusion model is finetuned on its own generated images.<n>We demonstrate that severe image quality degradation was universal and identify CFG scale as the key factor impacting this model collapse.<n>We propose Reusable Diffusion Finetuning (ReDiFine), a simple yet effective strategy inspired by genetic mutations.
arXiv Detail & Related papers (2024-07-04T13:41:54Z) - How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse [9.59833542807268]
Model collapse occurs when new models are trained on synthetic data generated from previously trained models.
We show that model collapse cannot be avoided when training solely on synthetic data.
We estimate a maximal amount of synthetic data below which model collapse can eventually be avoided.
arXiv Detail & Related papers (2024-04-07T22:15:13Z) - Heat Death of Generative Models in Closed-Loop Learning [63.83608300361159]
We study the learning dynamics of generative models that are fed back their own produced content in addition to their original training dataset.
We show that, unless a sufficient amount of external data is introduced at each iteration, any non-trivial temperature leads the model to degenerate.
arXiv Detail & Related papers (2024-04-02T21:51:39Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Model Collapse Demystified: The Case of Regression [12.115359951879462]
We study the phenomenon of "model collapse" in the era of proliferation of large language and image generation models.
We obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes.
We propose a simple strategy based on adaptive regularization to mitigate model collapse.
arXiv Detail & Related papers (2024-02-12T15:26:01Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - Towards Robust and Adaptive Motion Forecasting: A Causal Representation
Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables.
We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph.
Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.