Synthetic Dataset Generation for Privacy-Preserving Machine Learning
- URL: http://arxiv.org/abs/2210.03205v2
- Date: Mon, 10 Oct 2022 17:20:04 GMT
- Title: Synthetic Dataset Generation for Privacy-Preserving Machine Learning
- Authors: Efstathia Soufleri, Gobinda Saha, Kaushik Roy
- Abstract summary: We propose a method to generate secure synthetic datasets from the original private datasets.
We show that our proposed method preserves data-privacy under various privacy-leakage attacks.
- Score: 7.489265323050362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning (ML) has achieved enormous success in solving a variety of
problems in computer vision, speech recognition, object detection, to name a
few. The principal reason for this success is the availability of huge datasets
for training deep neural networks (DNNs). However, datasets cannot be publicly
released if they contain sensitive information such as medical records, and
data privacy becomes a major concern. Encryption methods could be a possible
solution, however their deployment on ML applications seriously impacts
classification accuracy and results in substantial computational overhead.
Alternatively, obfuscation techniques could be used, but maintaining a good
trade-off between visual privacy and accuracy is challenging. In this paper, we
propose a method to generate secure synthetic datasets from the original
private datasets. Given a network with Batch Normalization (BN) layers
pretrained on the original dataset, we first record the class-wise BN layer
statistics. Next, we generate the synthetic dataset by optimizing random noise
such that the synthetic data match the layer-wise statistical distribution of
original images. We evaluate our method on image classification datasets
(CIFAR10, ImageNet) and show that synthetic data can be used in place of the
original CIFAR10/ImageNet data for training networks from scratch, producing
comparable classification performance. Further, to analyze visual privacy
provided by our method, we use Image Quality Metrics and show high degree of
visual dissimilarity between the original and synthetic images. Moreover, we
show that our proposed method preserves data-privacy under various
privacy-leakage attacks including Gradient Matching Attack, Model Memorization
Attack, and GAN-based Attack.
Related papers
- DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution.
DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model.
We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z) - Federated Face Forgery Detection Learning with Personalized Representation [63.90408023506508]
Deep generator technology can produce high-quality fake videos that are indistinguishable, posing a serious social threat.
Traditional forgery detection methods directly centralized training on data.
The paper proposes a novel federated face forgery detection learning with personalized representation.
arXiv Detail & Related papers (2024-06-17T02:20:30Z) - Integrating kNN with Foundation Models for Adaptable and Privacy-Aware
Image Classification [0.13108652488669734]
Traditional deep learning models implicity encode knowledge limiting their transparency and ability to adapt to data changes.
We address this limitation by storing embeddings of the underlying training data independently of the model weights.
Our approach integrates the $k$-Nearest Neighbor ($k$-NN) classifier with a vision-based foundation model, pre-trained self-supervised on natural images.
arXiv Detail & Related papers (2024-02-19T20:08:13Z) - PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining [13.823621924706348]
Differential Privacy (DP) image data synthesis allows organizations to share and utilize synthetic images without privacy concerns.
Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data.
This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data.
arXiv Detail & Related papers (2023-10-19T14:04:53Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - Attribute-preserving Face Dataset Anonymization via Latent Code
Optimization [64.4569739006591]
We present a task-agnostic anonymization procedure that directly optimize the images' latent representation in the latent space of a pre-trained GAN.
We demonstrate through a series of experiments that our method is capable of anonymizing the identity of the images whilst -- crucially -- better-preserving the facial attributes.
arXiv Detail & Related papers (2023-03-20T17:34:05Z) - ConfounderGAN: Protecting Image Data Privacy with Causal Confounder [85.6757153033139]
We propose ConfounderGAN, a generative adversarial network (GAN) that can make personal image data unlearnable to protect the data privacy of its owners.
Experiments are conducted in six image classification datasets, consisting of three natural object datasets and three medical datasets.
arXiv Detail & Related papers (2022-12-04T08:49:14Z) - Content-Aware Differential Privacy with Conditional Invertible Neural
Networks [0.7102341019971402]
Invertible Neural Networks (INNs) have shown excellent generative performance while still providing the ability to quantify the exact likelihood.
We hypothesize that adding noise to the latent space of an INN can enable differentially private image modification.
We conduct experiments on publicly available benchmarking datasets as well as dedicated medical ones.
arXiv Detail & Related papers (2022-07-29T11:52:16Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.