Related papers: On the Limitation of Diffusion Models for Synthesizing Training Datasets

On the Limitation of Diffusion Models for Synthesizing Training Datasets

URL: http://arxiv.org/abs/2311.13090v1
Date: Wed, 22 Nov 2023 01:42:23 GMT
Title: On the Limitation of Diffusion Models for Synthesizing Training Datasets
Authors: Shin'ya Yamaguchi and Takuma Fukuda
Abstract summary: This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. We found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models.
Score: 5.384630221560811
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Synthetic samples from diffusion models are promising for leveraging in training discriminative models as replications of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models. This means that modern diffusion models do not perfectly represent the data distribution for the purpose of replicating datasets for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information added by diffusion models. Through assessing the reconstructed samples and trained models, we found that the synthetic data are concentrated in modes of the training data distribution as the reverse step increases, and thus, they are difficult to cover the outer edges of the distribution. Our findings imply that modern diffusion models are insufficient to replicate training data distribution perfectly, and there is room for the improvement of generative modeling in the replication of training datasets.

Related papers

Multimodal Atmospheric Super-Resolution With Deep Generative Models [0.0]
Score-based diffusion modeling is a generative machine learning algorithm that can be used to sample from complex distributions.<n>In this article, we apply such a concept to the super-resolution of a high-dimensional dynamical system, given the real-time availability of low-resolution and experimentally observed sparse sensor measurements.
arXiv Detail & Related papers (2025-06-28T06:47:09Z)
SeisRDT: Latent Diffusion Model Based On Representation Learning For Seismic Data Interpolation And Reconstruction [11.530476559185878]
Due to limitations such as geographic, physical, or economic factors, collected seismic data often have missing traces. Traditional seismic data reconstruction methods face the challenge of selecting numerous empirical parameters and struggle to handle large-scale continuous missing traces. We propose a latent diffusion transformer utilizing representation learning for seismic data reconstruction.
arXiv Detail & Related papers (2025-03-17T10:16:35Z)
Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset. We develop constrained diffusion models by imposing diffusion constraints based on desired distributions. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z)
Provable Statistical Rates for Consistency Diffusion Models [87.28777947976573]
Despite the state-of-the-art performance, diffusion models are known for their slow sample generation due to the extensive number of steps involved. This paper contributes towards the first statistical theory for consistency models, formulating their training as a distribution discrepancy minimization problem.
arXiv Detail & Related papers (2024-06-23T20:34:18Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes. Deep generative models, including diffusion models, are biased towards classes with abundant training images. We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z)
Lecture Notes in Probabilistic Diffusion Models [0.5361320134021585]
Diffusion models are loosely modelled based on non-equilibrium thermodynamics. The diffusion model learns the data manifold to which the original and thus the reconstructed data samples belong. Diffusion models have -- unlike variational autoencoder and flow models -- latent variables with the same dimensionality as the original data.
arXiv Detail & Related papers (2023-12-16T09:36:54Z)
Private Synthetic Data Meets Ensemble Learning [15.425653946755025]
When machine learning models are trained on synthetic data and then deployed on real data, there is often a performance drop. We introduce a new ensemble strategy for training downstream models, with the goal of enhancing their performance when used on real data.
arXiv Detail & Related papers (2023-10-15T04:24:42Z)
Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task [20.749514363389878]
We study compositional generalization in conditional diffusion models in a synthetic setting. We find that the order in which the ability to generate samples emerges is governed by the structure of the underlying data-generating process. Our study lays a foundation for understanding capabilities and compositionality in generative models from a data-centric perspective.
arXiv Detail & Related papers (2023-10-13T18:00:59Z)
On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z)
Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.