Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets
- URL: http://arxiv.org/abs/2502.05446v1
- Date: Sat, 08 Feb 2025 04:37:39 GMT
- Title: Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets
- Authors: Haoye Lu, Qifan Wu, Yaoliang Yu,
- Abstract summary: Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement.<n>A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content.<n>We show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable.
- Score: 20.11176801612364
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain an FID of $6.31$ on CIFAR-10 with just $4\%$ clean images (and $3.58$ with $10\%$). Theoretically, we prove that SFBD guides the model to learn the true data distribution. The result also highlights the importance of pretraining on limited but clean data or the alternative from similar datasets. Empirical studies further support these findings and offer additional insights.
Related papers
- Data Unlearning in Diffusion Models [44.99833362998488]
General-purpose machine unlearning techniques were found to be either unstable or failed to unlearn data.
We propose a family of new loss functions called Subtracted Importance Sampled Scores (SISS) that utilize importance sampling and are the first method to unlearn data with theoretical guarantees.
arXiv Detail & Related papers (2025-03-02T21:36:04Z) - Redistribute Ensemble Training for Mitigating Memorization in Diffusion Models [31.92526915009259]
Diffusion models are known for their tremendous ability to generate high-quality samples.<n>Recent methods for memory mitigation have primarily addressed the issue within the context of the text modality.<n>We propose a novel method for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization.
arXiv Detail & Related papers (2025-02-13T15:56:44Z) - Understanding Fine-tuning in Approximate Unlearning: A Theoretical Perspective [39.958103832214135]
Fine-tuning (FT) methods have become one of the fundamental approaches for approximating unlearning.<n>We present the first theoretical analysis of FT methods for machine unlearning within a linear regression framework.<n>We propose a novel Retention-Based Masking (RBM) strategy that constructs a weight saliency map based on the remaining dataset.
arXiv Detail & Related papers (2024-10-04T18:01:52Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data [74.2507346810066]
Ambient diffusion is a recently proposed framework for training diffusion models using corrupted data.
We present the first framework for training diffusion models that provably sample from the uncorrupted distribution given only noisy training data.
arXiv Detail & Related papers (2024-03-20T14:22:12Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Ambient Diffusion: Learning Clean Distributions from Corrupted Data [77.34772355241901]
We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples.
Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples.
arXiv Detail & Related papers (2023-05-30T17:43:33Z) - GSURE-Based Diffusion Model Training with Corrupted Data [35.56267114494076]
We propose a novel training technique for generative diffusion models based only on corrupted data.
We demonstrate our technique on face images as well as Magnetic Resonance Imaging (MRI)
arXiv Detail & Related papers (2023-05-22T15:27:20Z) - Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be
Consistent [97.64313409741614]
We propose to enforce a emphconsistency property which states that predictions of the model on its own generated data are consistent across time.
We show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ.
arXiv Detail & Related papers (2023-02-17T18:45:04Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Fair Generative Models via Transfer Learning [39.12323728810492]
We propose fairTL, a transfer learning approach to learn fair generative models.
We introduce two additional innovations to improve upon fairTL: (i) multiple feedback and (ii) Linear-Probing followed by Fine-Tuning.
Extensive experiments show that fairTL and fairTL++ achieve state-of-the-art in both quality and fairness of generated samples.
arXiv Detail & Related papers (2022-12-02T01:44:38Z) - Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay [5.3330804968579795]
Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data.
Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process.
However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy.
This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data.
We propose to model the distribution of the previously observed synthetic samples
arXiv Detail & Related papers (2022-01-09T14:14:28Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Learning Energy-Based Models by Diffusion Recovery Likelihood [61.069760183331745]
We present a diffusion recovery likelihood method to tractably learn and sample from a sequence of energy-based models.
After training, synthesized images can be generated by the sampling process that initializes from Gaussian white noise distribution.
On unconditional CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the majority of GANs.
arXiv Detail & Related papers (2020-12-15T07:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.