Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression
- URL: http://arxiv.org/abs/2509.22341v1
- Date: Fri, 26 Sep 2025 13:34:48 GMT
- Title: Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression
- Authors: Anvit Garg, Sohom Bhattacharya, Pragya Sur,
- Abstract summary: Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs.<n>We derive precise error formulae for minimum-$ell$-norm and ridge regression under this iterative scheme.<n>Our analysis reveals intriguing properties of the mixing weight that minimizes long-term prediction error.
- Score: 4.71547360356314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$\ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$\ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We validate our theoretical results with extensive simulations.
Related papers
- Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z) - Test Set Sizing for the Ridge Regression [51.56484100374058]
We derive the ideal train/test split for the ridge regression to high accuracy in the limit that the number of training rows m becomes large.<n>This is the first time that such a split is calculated mathematically for a machine learning model in the large data limit.
arXiv Detail & Related papers (2025-04-27T13:17:18Z) - Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification [11.6055501181235]
We investigate the use of verification on synthesized data to prevent model collapse.
We show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse.
arXiv Detail & Related papers (2024-06-11T17:46:16Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Adaptive Optimization for Prediction with Missing Data [6.800113478497425]
We show that some adaptive linear regression models are equivalent to learning an imputation rule and a downstream linear regression model simultaneously.<n>In settings where data is strongly not missing at random, our methods achieve a 2-10% improvement in out-of-sample accuracy.
arXiv Detail & Related papers (2024-02-02T16:35:51Z) - Augmented balancing weights as linear regression [3.877356414450364]
We provide a novel characterization of augmented balancing weights, also known as automatic debiased machine learning (AutoDML)
We show that the augmented estimator is equivalent to a single linear model with coefficients that combine the coefficients from the original outcome model and coefficients from an unpenalized ordinary least squares (OLS) fit on the same data.
Our framework opens the black box on this increasingly popular class of estimators.
arXiv Detail & Related papers (2023-04-27T21:53:54Z) - Analysis of Interpolating Regression Models and the Double Descent
Phenomenon [3.883460584034765]
It is commonly assumed that models which interpolate noisy training data are poor to generalize.
The best models obtained are overparametrized and the testing error exhibits the double descent behavior as the model order increases.
We derive a result based on the behavior of the smallest singular value of the regression matrix that explains the peak location and the double descent shape of the testing error as a function of model order.
arXiv Detail & Related papers (2023-04-17T09:44:33Z) - Benign-Overfitting in Conditional Average Treatment Effect Prediction
with Linear Regression [14.493176427999028]
We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE) with linear regression models.
We show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known.
arXiv Detail & Related papers (2022-02-10T18:51:52Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Convex Latent Effect Logit Model via Sparse and Low-rank Decomposition [2.1915057426589746]
We propose a convexparametric convexparametric formulation for learning logistic regression model (logit) with latent heterogeneous effect on sub-population.
Despite its popularity, the mixed logit approach for learning individual heterogeneity has several downsides.
arXiv Detail & Related papers (2021-08-22T22:23:39Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.