Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets
- URL: http://arxiv.org/abs/2502.05446v1
- Date: Sat, 08 Feb 2025 04:37:39 GMT
- Title: Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets
- Authors: Haoye Lu, Qifan Wu, Yaoliang Yu,
- Abstract summary: Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement.
A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content.
We show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable.
- Score: 20.11176801612364
- License:
- Abstract: Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain an FID of $6.31$ on CIFAR-10 with just $4\%$ clean images (and $3.58$ with $10\%$). Theoretically, we prove that SFBD guides the model to learn the true data distribution. The result also highlights the importance of pretraining on limited but clean data or the alternative from similar datasets. Empirical studies further support these findings and offer additional insights.
Related papers
- Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty [9.749638953163391]
We introduce a Difficulty and Uncertainty-Aware Lightweight (DUAL) score to identify important samples from the early training stage.
We also propose a ratio-adaptive sampling using Beta distribution to address a catastrophic accuracy drop at an extreme pruning.
arXiv Detail & Related papers (2025-02-10T01:18:40Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data [74.2507346810066]
Ambient diffusion is a recently proposed framework for training diffusion models using corrupted data.
We present the first framework for training diffusion models that provably sample from the uncorrupted distribution given only noisy training data.
arXiv Detail & Related papers (2024-03-20T14:22:12Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - Ambient Diffusion: Learning Clean Distributions from Corrupted Data [77.34772355241901]
We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples.
Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples.
arXiv Detail & Related papers (2023-05-30T17:43:33Z) - GSURE-Based Diffusion Model Training with Corrupted Data [35.56267114494076]
We propose a novel training technique for generative diffusion models based only on corrupted data.
We demonstrate our technique on face images as well as Magnetic Resonance Imaging (MRI)
arXiv Detail & Related papers (2023-05-22T15:27:20Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Fair Generative Models via Transfer Learning [39.12323728810492]
We propose fairTL, a transfer learning approach to learn fair generative models.
We introduce two additional innovations to improve upon fairTL: (i) multiple feedback and (ii) Linear-Probing followed by Fine-Tuning.
Extensive experiments show that fairTL and fairTL++ achieve state-of-the-art in both quality and fairness of generated samples.
arXiv Detail & Related papers (2022-12-02T01:44:38Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.