Differentially Private Diffusion Models Generate Useful Synthetic Images
- URL: http://arxiv.org/abs/2302.13861v1
- Date: Mon, 27 Feb 2023 15:02:04 GMT
- Title: Differentially Private Diffusion Models Generate Useful Synthetic Images
- Authors: Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert
Stanforth, Jamie Hayes, Soham De, Samuel L. Smith, Olivia Wiles, Borja Balle
- Abstract summary: Recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy.
By privately fine-tuning ImageNet pre-trained diffusion models with more than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17.
Our results demonstrate that diffusion models fine-tuned with differential privacy can produce useful and provably private synthetic data.
- Score: 53.94025967603649
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to generate privacy-preserving synthetic versions of sensitive
image datasets could unlock numerous ML applications currently constrained by
data availability. Due to their astonishing image generation quality, diffusion
models are a prime candidate for generating high-quality synthetic data.
However, recent studies have found that, by default, the outputs of some
diffusion models do not preserve training data privacy. By privately
fine-tuning ImageNet pre-trained diffusion models with more than 80M
parameters, we obtain SOTA results on CIFAR-10 and Camelyon17 in terms of both
FID and the accuracy of downstream classifiers trained on synthetic data. We
decrease the SOTA FID on CIFAR-10 from 26.2 to 9.8, and increase the accuracy
from 51.0% to 88.0%. On synthetic data from Camelyon17, we achieve a downstream
accuracy of 91.1% which is close to the SOTA of 96.5% when training on the real
data. We leverage the ability of generative models to create infinite amounts
of data to maximise the downstream prediction performance, and further show how
to use synthetic data for hyperparameter tuning. Our results demonstrate that
diffusion models fine-tuned with differential privacy can produce useful and
provably private synthetic data, even in applications with significant
distribution shift between the pre-training and fine-tuning distributions.
Related papers
- Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory [8.713796223707398]
We use random matrix theory to derive the performance of a binary classifier trained on a mix of real and synthetic data.
Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy.
arXiv Detail & Related papers (2024-10-11T16:09:27Z) - Self-Improving Diffusion Models with Synthetic Data [12.597035060380001]
Self-IM diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models.
SIMS uses self-synthesized data to provide negative guidance during the generation process.
It is the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD.
arXiv Detail & Related papers (2024-08-29T08:12:18Z) - Efficient Differentially Private Fine-Tuning of Diffusion Models [15.71777343534365]
Fine-tuning large diffusion models with DP-SGD can be very resource-demanding in terms of memory usage and computation.
In this work, we investigate Efficient Fine-Tuning (PEFT) of diffusion models using Low-Dimensional Adaptation (LoDA) with Differential Privacy.
Our source code will be made available on GitHub.
arXiv Detail & Related papers (2024-06-07T21:00:20Z) - Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes.
Deep generative models, including diffusion models, are biased towards classes with abundant training images.
We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z) - PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining [13.823621924706348]
Differential Privacy (DP) image data synthesis allows organizations to share and utilize synthetic images without privacy concerns.
Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data.
This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data.
arXiv Detail & Related papers (2023-10-19T14:04:53Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Transitioning from Real to Synthetic data: Quantifying the bias in model [1.6134566438137665]
This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data.
We demonstrate there exist a varying levels of bias impact on models trained using synthetic data.
arXiv Detail & Related papers (2021-05-10T06:57:14Z) - Denoising Diffusion Probabilistic Models [91.94962645056896]
We present high quality image synthesis results using diffusion probabilistic models.
Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics.
arXiv Detail & Related papers (2020-06-19T17:24:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.