Related papers: Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees

Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees

URL: http://arxiv.org/abs/2003.00997v1
Date: Mon, 2 Mar 2020 16:23:41 GMT
Title: Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees
Authors: Aleksei Triastcyn, Boi Faltings
Abstract summary: We consider the problem of enhancing user privacy in common machine learning development tasks, such as data annotation and inspection. We propose employing Bayesian differential privacy as the means to achieve a rigorous theoretical guarantee while providing a better privacy-utility trade-off.
Score: 34.01962235805095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper considers the problem of enhancing user privacy in common machine learning development tasks, such as data annotation and inspection, by substituting the real data with samples form a generative adversarial network. We propose employing Bayesian differential privacy as the means to achieve a rigorous theoretical guarantee while providing a better privacy-utility trade-off. We demonstrate experimentally that our approach produces higher-fidelity samples, compared to prior work, allowing to (1) detect more subtle data errors and biases, and (2) reduce the need for real data labelling by achieving high accuracy when training directly on artificial samples.

Related papers

A Privacy-Preserving Data Collection Method for Diversified Statistical Analysis [11.135689359531105]
This paper proposes a novel real-value negative survey model, termed RVNS, for the first time in the field of real-value sensitive information collection.<n>The RVNS model exempts users from the necessity of discretizing their data and only requires them to sample a set of data from a range that deviates from their actual sensitive details.
arXiv Detail & Related papers (2025-07-23T04:05:33Z)
Unlocking Post-hoc Dataset Inference with Synthetic Data [11.886166976507711]
Training datasets are often scraped from the internet without respecting data owners' intellectual property rights.<n>Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training.<n>Existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution.<n>In this work, we address this challenge by synthetically generating the required held-out set.
arXiv Detail & Related papers (2025-06-18T08:46:59Z)
DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing [0.8739101659113155]
We introduce an effective data publishing algorithm emphDP-CDA. Our proposed algorithm generates synthetic datasets by randomly mixing data in a class-specific manner, and inducing carefully-tuned randomness to ensure privacy guarantees. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by traditional data publishing algorithms, even when subject to the same privacy requirements.
arXiv Detail & Related papers (2024-11-25T06:14:06Z)
Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice [0.3069335774032178]
It is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models.
arXiv Detail & Related papers (2024-11-19T12:19:28Z)
Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner. Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. RAG systems may face severe privacy risks when retrieving private data. We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning. We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z)
On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z)
Towards Generalizable Data Protection With Transferable Unlearnable Examples [50.628011208660645]
We present a novel, generalizable data protection method by generating transferable unlearnable examples. To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
arXiv Detail & Related papers (2023-05-18T04:17:01Z)
Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge. Existing private generative models are struggling with the utility of synthetic samples. We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z)
Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution. We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator. Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model [23.91327154831855]
This paper proposes privacy-preserving phased generative model (P3GM) for releasing sensitive data. P3GM employs the two-phase learning process to make it robust against the noise, and to increase learning efficiency. Compared with the state-of-the-art methods, our generated samples look fewer noises and closer to the original data in terms of data diversity.
arXiv Detail & Related papers (2020-06-22T09:47:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.