Ambient Diffusion: Learning Clean Distributions from Corrupted Data
- URL: http://arxiv.org/abs/2305.19256v1
- Date: Tue, 30 May 2023 17:43:33 GMT
- Title: Ambient Diffusion: Learning Clean Distributions from Corrupted Data
- Authors: Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alexandros
G. Dimakis, Adam Klivans
- Abstract summary: We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples.
Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples.
- Score: 77.34772355241901
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the first diffusion-based framework that can learn an unknown
distribution using only highly-corrupted samples. This problem arises in
scientific applications where access to uncorrupted samples is impossible or
expensive to acquire. Another benefit of our approach is the ability to train
generative models that are less likely to memorize individual training samples
since they never observe clean training data. Our main idea is to introduce
additional measurement distortion during the diffusion process and require the
model to predict the original corrupted image from the further corrupted image.
We prove that our method leads to models that learn the conditional expectation
of the full uncorrupted image given this additional measurement corruption.
This holds for any corruption process that satisfies some technical conditions
(and in particular includes inpainting and compressed sensing). We train models
on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn
the distribution even when all the training samples have $90\%$ of their pixels
missing. We also show that we can finetune foundation models on small corrupted
datasets (e.g. MRI scans with block corruptions) and learn the clean
distribution without memorizing the training set.
Related papers
- Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets [20.11176801612364]
Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement.
A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content.
We show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable.
arXiv Detail & Related papers (2025-02-08T04:37:39Z) - Patch-Based Diffusion Models Beat Whole-Image Models for Mismatched Distribution Inverse Problems [12.5216516851131]
We study out of distribution (OOD) problems where a known training distribution is first provided.
We use a patch-based diffusion prior that learns the image distribution solely from patches.
In both settings, the patch-based method can obtain high quality image reconstructions that can outperform whole-image models.
arXiv Detail & Related papers (2024-10-15T16:02:08Z) - CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion [58.64822817224639]
Diffusion models have a tendency to exactly replicate their training data, especially when trained on small datasets.
We present CPSample, a method that modifies the sampling process to prevent training data replication while preserving image quality.
CPSample achieves FID scores of 4.97 and 2.97 on CIFAR-10 and CelebA-64, respectively, without producing exact replicates of the training data.
arXiv Detail & Related papers (2024-09-11T05:42:01Z) - Improved Distribution Matching Distillation for Fast Image Synthesis [54.72356560597428]
We introduce DMD2, a set of techniques that lift this limitation and improve DMD training.
First, we eliminate the regression loss and the need for expensive dataset construction.
Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images.
arXiv Detail & Related papers (2024-05-23T17:59:49Z) - Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data [74.2507346810066]
Ambient diffusion is a recently proposed framework for training diffusion models using corrupted data.
We present the first framework for training diffusion models that provably sample from the uncorrupted distribution given only noisy training data.
arXiv Detail & Related papers (2024-03-20T14:22:12Z) - Ambient Diffusion Posterior Sampling: Solving Inverse Problems with
Diffusion Models trained on Corrupted Data [56.81246107125692]
Ambient Diffusion Posterior Sampling (A-DPS) is a generative model pre-trained on one type of corruption.
We show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance.
We extend the Ambient Diffusion framework to train MRI models with access only to Fourier subsampled multi-coil MRI measurements.
arXiv Detail & Related papers (2024-03-13T17:28:20Z) - Masked Diffusion Models Are Fast Distribution Learners [32.485235866596064]
Diffusion models are commonly trained to learn all fine-grained visual information from scratch.
We show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution.
Then the pre-trained model can be fine-tuned for various generation tasks efficiently.
arXiv Detail & Related papers (2023-06-20T08:02:59Z) - Soft Diffusion: Score Matching for General Corruptions [84.26037497404195]
We propose a new objective called Soft Score Matching that provably learns the score function for any linear corruption process.
We show that our objective learns the gradient of the likelihood under suitable regularity conditions for the family of corruption processes.
Our method achieves state-of-the-art FID score $1.85$ on CelebA-64, outperforming all previous linear diffusion models.
arXiv Detail & Related papers (2022-09-12T17:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.