Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees
- URL: http://arxiv.org/abs/2003.00997v1
- Date: Mon, 2 Mar 2020 16:23:41 GMT
- Title: Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees
- Authors: Aleksei Triastcyn, Boi Faltings
- Abstract summary: We consider the problem of enhancing user privacy in common machine learning development tasks, such as data annotation and inspection.
We propose employing Bayesian differential privacy as the means to achieve a rigorous theoretical guarantee while providing a better privacy-utility trade-off.
- Score: 34.01962235805095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers the problem of enhancing user privacy in common machine
learning development tasks, such as data annotation and inspection, by
substituting the real data with samples form a generative adversarial network.
We propose employing Bayesian differential privacy as the means to achieve a
rigorous theoretical guarantee while providing a better privacy-utility
trade-off. We demonstrate experimentally that our approach produces
higher-fidelity samples, compared to prior work, allowing to (1) detect more
subtle data errors and biases, and (2) reduce the need for real data labelling
by achieving high accuracy when training directly on artificial samples.
Related papers
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data Publication [16.055684281505474]
This article proposes a Vertical Federated Learning-based Generative Adrial Network, VFLGAN, for vertically partitioned data publication.
The quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN.
We also propose a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.
arXiv Detail & Related papers (2024-04-15T12:25:41Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - On the Universal Adversarial Perturbations for Efficient Data-free
Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs.
Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z) - Towards Generalizable Data Protection With Transferable Unlearnable
Examples [50.628011208660645]
We present a novel, generalizable data protection method by generating transferable unlearnable examples.
To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
arXiv Detail & Related papers (2023-05-18T04:17:01Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Synthetic Text Generation with Differential Privacy: A Simple and
Practical Recipe [32.63295550058343]
We show that a simple and practical recipe in the text domain is effective in generating useful synthetic text with strong privacy protection.
Our method produces synthetic text that is competitive in terms of utility with its non-private counterpart.
arXiv Detail & Related papers (2022-10-25T21:21:17Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - RDP-GAN: A R\'enyi-Differential Privacy based Generative Adversarial
Network [75.81653258081435]
Generative adversarial network (GAN) has attracted increasing attention recently owing to its impressive ability to generate realistic samples with high privacy protection.
However, when GANs are applied on sensitive or private training examples, such as medical or financial records, it is still probable to divulge individuals' sensitive and private information.
We propose a R'enyi-differentially private-GAN (RDP-GAN), which achieves differential privacy (DP) in a GAN by carefully adding random noises on the value of the loss function during training.
arXiv Detail & Related papers (2020-07-04T09:51:02Z) - P3GM: Private High-Dimensional Data Release via Privacy Preserving
Phased Generative Model [23.91327154831855]
This paper proposes privacy-preserving phased generative model (P3GM) for releasing sensitive data.
P3GM employs the two-phase learning process to make it robust against the noise, and to increase learning efficiency.
Compared with the state-of-the-art methods, our generated samples look fewer noises and closer to the original data in terms of data diversity.
arXiv Detail & Related papers (2020-06-22T09:47:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.