Related papers: Towards Theoretical Understandings of Self-Consuming Generative Models

Towards Theoretical Understandings of Self-Consuming Generative Models

URL: http://arxiv.org/abs/2402.11778v2
Date: Mon, 24 Jun 2024 14:23:30 GMT
Title: Towards Theoretical Understandings of Self-Consuming Generative Models
Authors: Shi Fu, Sen Zhang, Yingjie Wang, Xinmei Tian, Dacheng Tao,
Abstract summary: This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
Score: 56.84592466204185
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.

Related papers

Lost in Retraining: Roaming the Parameter Space of Exponential Families Under Closed-Loop Learning [0.0]
We study closed-loop learning for models that belong to exponential families.<n>We show that maximum likelihood of the parameters endows sufficient statistics with the martingale property.<n>We show that this outcome may be prevented if the data contains at least one data point generated from a ground truth model.
arXiv Detail & Related papers (2025-06-25T17:12:22Z)
ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model [5.24776944932192]
We introduce the Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM), inspired by photon arrival process.<n>Central to our approach is an information-theoretic Poisson Reconstruction Loss (PRL) that has a provable exact relationship with the true data likelihood.<n>ItDPDM attains superior likelihood estimates and competitive generation quality-demonstrating a proof of concept for distribution-robust discrete generative modeling.
arXiv Detail & Related papers (2025-05-08T09:29:05Z)
Uncertainty quantification of neural network models of evolving processes via Langevin sampling [0.7329200485567827]
We propose a scalable, approximate inference hypernetwork framework for a general model of history-dependent processes. We demonstrate performance of the hypernetwork on chemical reaction and material physics data and compare it to mean-field variational inference.
arXiv Detail & Related papers (2025-04-21T04:45:40Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs) Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Golden Ratio Weighting Prevents Model Collapse [9.087950471621653]
Recent studies identified a phenomenon in generative model training known as model collapse. We investigate this phenomenon theoretically within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data.
arXiv Detail & Related papers (2025-02-25T10:15:16Z)
Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset. We develop constrained diffusion models by imposing diffusion constraints based on desired distributions. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z)
Generating Synthetic Net Load Data with Physics-informed Diffusion Model [0.8848340429852071]
A conditional denoising neural network is designed to jointly train the parameters of the transition kernel of the diffusion model. A comprehensive set of evaluation metrics is used to assess the accuracy and diversity of the generated synthetic net load data.
arXiv Detail & Related papers (2024-06-04T02:50:19Z)
Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation. In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model. We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z)
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse [9.59833542807268]
Model collapse occurs when new models are trained on synthetic data generated from previously trained models. We show that model collapse cannot be avoided when training solely on synthetic data. We estimate a maximal amount of synthetic data below which model collapse can eventually be avoided.
arXiv Detail & Related papers (2024-04-07T22:15:13Z)
On the Limitation of Diffusion Models for Synthesizing Training Datasets [5.384630221560811]
This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. We found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-11-22T01:42:23Z)
On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z)
MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z)
Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z)
Mixed Effects Neural ODE: A Variational Approximation for Analyzing the Dynamics of Panel Data [50.23363975709122]
We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing panel data. We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem. We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms.
arXiv Detail & Related papers (2022-02-18T22:41:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.