PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining
- URL: http://arxiv.org/abs/2311.12850v4
- Date: Mon, 07 Oct 2024 19:51:47 GMT
- Title: PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining
- Authors: Kecen Li, Chen Gong, Zhixiang Li, Yuzhong Zhao, Xinwen Hou, Tianhao Wang,
- Abstract summary: Differential Privacy (DP) image data synthesis allows organizations to share and utilize synthetic images without privacy concerns.
Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data.
This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data.
- Score: 13.823621924706348
- License:
- Abstract: Differential Privacy (DP) image data synthesis, which leverages the DP technique to generate synthetic data to replace the sensitive data, allowing organizations to share and utilize synthetic images without privacy concerns. Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data, but suffer from problems of unstable training and massive computational resource demands. This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data, promoting the efficient creation of DP datasets with high fidelity and utility. PRIVIMAGE first establishes a semantic query function using a public dataset. Then, this function assists in querying the semantic distribution of the sensitive dataset, facilitating the selection of data from the public dataset with analogous semantics for pre-training. Finally, we pre-train an image generative model using the selected data and then fine-tune this model on the sensitive dataset using Differentially Private Stochastic Gradient Descent (DP-SGD). PRIVIMAGE allows us to train a lightly parameterized generative model, reducing the noise in the gradient during DP-SGD training and enhancing training stability. Extensive experiments demonstrate that PRIVIMAGE uses only 1% of the public dataset for pre-training and 7.6% of the parameters in the generative model compared to the state-of-the-art method, whereas achieves superior synthetic performance and conserves more computational resources. On average, PRIVIMAGE achieves 30.1% lower FID and 12.6% higher Classification Accuracy than the state-of-the-art method. The replication package and datasets can be accessed online.
Related papers
- Scalable Ranked Preference Optimization for Text-to-Image Generation [76.16285931871948]
We investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training.
The preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process.
We introduce RankDPO to enhance DPO-based methods using the ranking feedback.
arXiv Detail & Related papers (2024-10-23T16:42:56Z) - DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution.
DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model.
We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z) - Differentially Private Fine-Tuning of Diffusion Models [22.454127503937883]
The integration of Differential Privacy with diffusion models (DMs) presents a promising yet challenging frontier.
Recent developments in this field have highlighted the potential for generating high-quality synthetic data by pre-training on public data.
We propose a strategy optimized for private diffusion models, which minimizes the number of trainable parameters to enhance the privacy-utility trade-off.
arXiv Detail & Related papers (2024-06-03T14:18:04Z) - HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts [49.21764163995419]
We introduce HYPerbolic Entailment filtering (HYPE) to extract meaningful and well-aligned data from noisy image-text pair datasets.
HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark.
This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models.
arXiv Detail & Related papers (2024-04-26T16:19:55Z) - NaturalInversion: Data-Free Image Synthesis Improving Real-World
Consistency [1.1470070927586016]
We introduce NaturalInversion, a novel model inversion-based method to synthesize images that agrees well with the original data distribution without using real data.
We show that our images are more consistent with original data distribution than prior works by visualization and additional analysis.
arXiv Detail & Related papers (2023-06-29T03:43:29Z) - Harnessing large-language models to generate private synthetic text [18.863579044812703]
Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information.
This paper studies an alternative approach to generate synthetic data that is differentially private with respect to the original data, and then non-privately training a model on the synthetic data.
generating private synthetic data is much harder than training a private model.
arXiv Detail & Related papers (2023-06-02T16:59:36Z) - Differentially Private Diffusion Models Generate Useful Synthetic Images [53.94025967603649]
Recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy.
By privately fine-tuning ImageNet pre-trained diffusion models with more than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17.
Our results demonstrate that diffusion models fine-tuned with differential privacy can produce useful and provably private synthetic data.
arXiv Detail & Related papers (2023-02-27T15:02:04Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - Synthetic Dataset Generation for Privacy-Preserving Machine Learning [7.489265323050362]
We propose a method to generate secure synthetic datasets from the original private datasets.
We show that our proposed method preserves data-privacy under various privacy-leakage attacks.
arXiv Detail & Related papers (2022-10-06T20:54:52Z) - Commonality in Natural Images Rescues GANs: Pretraining GANs with
Generic and Privacy-free Synthetic Data [17.8055398673228]
We propose an effective and unbiased data synthesizer inspired by the generic characteristics of natural images.
Since our synthesizer only considers the generic properties of natural images, the single model pretrained on our dataset can be consistently transferred to various target datasets.
arXiv Detail & Related papers (2022-04-11T08:51:17Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.