Related papers: Self-Correcting Self-Consuming Loops for Generative Model Training

Self-Correcting Self-Consuming Loops for Generative Model Training

URL: http://arxiv.org/abs/2402.07087v3
Date: Mon, 10 Jun 2024 14:22:45 GMT
Title: Self-Correcting Self-Consuming Loops for Generative Model Training
Authors: Nate Gillman, Michael Freeman, Daksh Aggarwal, Chia-Hong Hsu, Calvin Luo, Yonglong Tian, Chen Sun,
Abstract summary: Machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates "self-consuming loops" Our paper aims to stabilize self-consuming generative model training by introducing an idealized correction function.
Score: 16.59453827606427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates "self-consuming loops" which may lead to training instability or even collapse, unless certain conditions are met. Our paper aims to stabilize self-consuming generative model training. Our theoretical results demonstrate that by introducing an idealized correction function, which maps a data point to be more likely under the true data distribution, self-consuming loops can be made exponentially more stable. We then propose self-correction functions, which rely on expert knowledge (e.g. the laws of physics programmed in a simulator), and aim to approximate the idealized corrector automatically and at scale. We empirically validate the effectiveness of self-correcting self-consuming loops on the challenging human motion synthesis task, and observe that it successfully avoids model collapse, even when the ratio of synthetic data to real data is as high as 100%.

Related papers

Self-Consuming Generative Models with Adversarially Curated Data [7.535438563518571]
We study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data.<n>We design attack algorithms for competitive adversarial scenarios, where a platform employs malicious users to misalign a rival's model from actual user preferences.
arXiv Detail & Related papers (2025-05-14T19:54:55Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs) Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
How to Synthesize Text Data without Model Collapse? [37.219627817995054]
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. We propose token editing on human-produced data to obtain semi-synthetic data.
arXiv Detail & Related papers (2024-12-19T09:43:39Z)
Enhancing Object Detection Accuracy in Autonomous Vehicles Using Synthetic Data [0.8267034114134277]
Performance of machine learning models depends on the nature and size of the training data sets. High-quality, diverse, relevant and representative training data is essential to build accurate and reliable machine learning models. It is hypothesised that well-designed synthetic data can improve the performance of a machine learning algorithm.
arXiv Detail & Related papers (2024-11-23T16:38:02Z)
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World [19.266191284270793]
generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models. Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data. We report experiments on three ways of using data (training-workflows) across three generative model task-settings.
arXiv Detail & Related papers (2024-10-22T05:49:24Z)
Self-Improving Diffusion Models with Synthetic Data [12.597035060380001]
Self-IM diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models. SIMS uses self-synthesized data to provide negative guidance during the generation process. It is the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD.
arXiv Detail & Related papers (2024-08-29T08:12:18Z)
Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences [20.629333587044012]
We study the impact of data curation on iterated retraining of generative models. We prove that, if the data is curated according to a reward model, the expected reward of the iterative retraining procedure is maximized.
arXiv Detail & Related papers (2024-06-12T21:28:28Z)
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification [11.6055501181235]
We investigate the use of verification on synthesized data to prevent model collapse. We show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse.
arXiv Detail & Related papers (2024-06-11T17:46:16Z)
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data [49.73114504515852]
We show that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse. We demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse.
arXiv Detail & Related papers (2024-04-01T18:31:24Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Real-to-Sim: Predicting Residual Errors of Robotic Systems with Sparse Data using a Learning-based Unscented Kalman Filter [65.93205328894608]
We learn the residual errors between a dynamic and/or simulator model and the real robot. We show that with the learned residual errors, we can further close the reality gap between dynamic models, simulations, and actual hardware.
arXiv Detail & Related papers (2022-09-07T15:15:12Z)
Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification) We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs. Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.