On the Limitation of Diffusion Models for Synthesizing Training Datasets
- URL: http://arxiv.org/abs/2311.13090v1
- Date: Wed, 22 Nov 2023 01:42:23 GMT
- Title: On the Limitation of Diffusion Models for Synthesizing Training Datasets
- Authors: Shin'ya Yamaguchi and Takuma Fukuda
- Abstract summary: This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process.
We found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models.
- Score: 5.384630221560811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic samples from diffusion models are promising for leveraging in
training discriminative models as replications of real training datasets.
However, we found that the synthetic datasets degrade classification
performance over real datasets even when using state-of-the-art diffusion
models. This means that modern diffusion models do not perfectly represent the
data distribution for the purpose of replicating datasets for training
discriminative tasks. This paper investigates the gap between synthetic and
real samples by analyzing the synthetic samples reconstructed from real samples
through the diffusion and reverse process. By varying the time steps starting
the reverse process in the reconstruction, we can control the trade-off between
the information in the original real data and the information added by
diffusion models. Through assessing the reconstructed samples and trained
models, we found that the synthetic data are concentrated in modes of the
training data distribution as the reverse step increases, and thus, they are
difficult to cover the outer edges of the distribution. Our findings imply that
modern diffusion models are insufficient to replicate training data
distribution perfectly, and there is room for the improvement of generative
modeling in the replication of training datasets.
Related papers
- Constrained Diffusion Models via Dual Training [80.03953599062365]
We develop constrained diffusion models based on desired distributions informed by requirements.
We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z) - Provable Statistical Rates for Consistency Diffusion Models [87.28777947976573]
Despite the state-of-the-art performance, diffusion models are known for their slow sample generation due to the extensive number of steps involved.
This paper contributes towards the first statistical theory for consistency models, formulating their training as a distribution discrepancy minimization problem.
arXiv Detail & Related papers (2024-06-23T20:34:18Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes.
Deep generative models, including diffusion models, are biased towards classes with abundant training images.
We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z) - Lecture Notes in Probabilistic Diffusion Models [0.5361320134021585]
Diffusion models are loosely modelled based on non-equilibrium thermodynamics.
The diffusion model learns the data manifold to which the original and thus the reconstructed data samples belong.
Diffusion models have -- unlike variational autoencoder and flow models -- latent variables with the same dimensionality as the original data.
arXiv Detail & Related papers (2023-12-16T09:36:54Z) - Private Synthetic Data Meets Ensemble Learning [15.425653946755025]
When machine learning models are trained on synthetic data and then deployed on real data, there is often a performance drop.
We introduce a new ensemble strategy for training downstream models, with the goal of enhancing their performance when used on real data.
arXiv Detail & Related papers (2023-10-15T04:24:42Z) - Compositional Abilities Emerge Multiplicatively: Exploring Diffusion
Models on a Synthetic Task [20.749514363389878]
We study compositional generalization in conditional diffusion models in a synthetic setting.
We find that the order in which the ability to generate samples emerges is governed by the structure of the underlying data-generating process.
Our study lays a foundation for understanding capabilities and compositionality in generative models from a data-centric perspective.
arXiv Detail & Related papers (2023-10-13T18:00:59Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Score Approximation, Estimation and Distribution Recovery of Diffusion
Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace.
We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated.
The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.