A Probabilistic Perspective on Model Collapse
- URL: http://arxiv.org/abs/2505.13947v2
- Date: Wed, 21 May 2025 20:43:02 GMT
- Title: A Probabilistic Perspective on Model Collapse
- Authors: Shirong Xu, Hengzhi He, Guang Cheng,
- Abstract summary: This paper aims to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated.<n>Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse.<n>We also investigate the probability that training on synthetic data yields models that outperform those trained solely on real data.
- Score: 9.087950471621653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training from a probabilistic perspective, aiming to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated. We conceptualize the recursive training process as a random walk of the model estimate, highlighting how the sample size influences the step size and how the estimation procedure determines the direction and potential bias of the random walk. Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse. In particular, when the estimation is unbiased, the required growth rate follows a superlinear pattern. This rate needs to be accelerated even further in the presence of substantial estimation bias. Building on this probabilistic framework, we also investigate the probability that recursive training on synthetic data yields models that outperform those trained solely on real data. Moreover, we extend these results to general parametric model family in an asymptotic regime. Finally, we validate our theoretical results through extensive simulations and a real-world dataset.
Related papers
- Lost in Retraining: Roaming the Parameter Space of Exponential Families Under Closed-Loop Learning [0.0]
We study closed-loop learning for models that belong to exponential families.<n>We show that maximum likelihood of the parameters endows sufficient statistics with the martingale property.<n>We show that this outcome may be prevented if the data contains at least one data point generated from a ground truth model.
arXiv Detail & Related papers (2025-06-25T17:12:22Z) - A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z) - On conditional diffusion models for PDE simulations [53.01911265639582]
We study score-based diffusion models for forecasting and assimilation of sparse observations.
We propose an autoregressive sampling approach that significantly improves performance in forecasting.
We also propose a new training strategy for conditional score-based models that achieves stable performance over a range of history lengths.
arXiv Detail & Related papers (2024-10-21T18:31:04Z) - A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)<n> Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.<n>Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z) - Towards Learning Stochastic Population Models by Gradient Descent [0.0]
We show that simultaneous estimation of parameters and structure poses major challenges for optimization procedures.
We demonstrate accurate estimation of models but find that enforcing the inference of parsimonious, interpretable models drastically increases the difficulty.
arXiv Detail & Related papers (2024-04-10T14:38:58Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Neural Likelihood Approximation for Integer Valued Time Series Data [0.0]
We construct a neural likelihood approximation that can be trained using unconditional simulation of the underlying model.
We demonstrate our method by performing inference on a number of ecological and epidemiological models.
arXiv Detail & Related papers (2023-10-19T07:51:39Z) - Last layer state space model for representation learning and uncertainty
quantification [0.0]
We propose to decompose a classification or regression task in two steps: a representation learning stage to learn low-dimensional states, and a state space model for uncertainty estimation.
We demonstrate how predictive distributions can be estimated on top of an existing and trained neural network, by adding a state space-based last layer.
Our model accounts for the noisy data structure, due to unknown or unavailable variables, and is able to provide confidence intervals on predictions.
arXiv Detail & Related papers (2023-07-04T08:37:37Z) - Plan To Predict: Learning an Uncertainty-Foreseeing Model for
Model-Based Reinforcement Learning [32.24146877835396]
We propose emphPlan To Predict (P2P), a framework that treats the model rollout process as a sequential decision making problem.
We show that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
arXiv Detail & Related papers (2023-01-20T10:17:22Z) - Probabilistic Modeling for Human Mesh Recovery [73.11532990173441]
This paper focuses on the problem of 3D human reconstruction from 2D evidence.
We recast the problem as learning a mapping from the input to a distribution of plausible 3D poses.
arXiv Detail & Related papers (2021-08-26T17:55:11Z) - Learning Interpretable Deep State Space Model for Probabilistic Time
Series Forecasting [98.57851612518758]
Probabilistic time series forecasting involves estimating the distribution of future based on its history.
We propose a deep state space model for probabilistic time series forecasting whereby the non-linear emission model and transition model are parameterized by networks.
We show in experiments that our model produces accurate and sharp probabilistic forecasts.
arXiv Detail & Related papers (2021-01-31T06:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.