Achilles' Heels: Vulnerable Record Identification in Synthetic Data
Publishing
- URL: http://arxiv.org/abs/2306.10308v2
- Date: Thu, 21 Sep 2023 09:17:16 GMT
- Title: Achilles' Heels: Vulnerable Record Identification in Synthetic Data
Publishing
- Authors: Matthieu Meeus, Florent Gu\'epin, Ana-Maria Cretu and Yves-Alexandre
de Montjoye
- Abstract summary: We propose a principled vulnerable record identification technique for synthetic data publishing.
We show it to strongly outperform previous ad-hoc methods across datasets and generators.
We show it to accurately identify vulnerable records when synthetic data generators are made differentially private.
- Score: 9.061271587514215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic data is seen as the most promising solution to share
individual-level data while preserving privacy. Shadow modeling-based
Membership Inference Attacks (MIAs) have become the standard approach to
evaluate the privacy risk of synthetic data. While very effective, they require
a large number of datasets to be created and models trained to evaluate the
risk posed by a single record. The privacy risk of a dataset is thus currently
evaluated by running MIAs on a handful of records selected using ad-hoc
methods. We here propose what is, to the best of our knowledge, the first
principled vulnerable record identification technique for synthetic data
publishing, leveraging the distance to a record's closest neighbors. We show
our method to strongly outperform previous ad-hoc methods across datasets and
generators. We also show evidence of our method to be robust to the choice of
MIA and to specific choice of parameters. Finally, we show it to accurately
identify vulnerable records when synthetic data generators are made
differentially private. The choice of vulnerable records is as important as
more accurate MIAs when evaluating the privacy of synthetic data releases,
including from a legal perspective. We here propose a simple yet highly
effective method to do so. We hope our method will enable practitioners to
better estimate the risk posed by synthetic data publishing and researchers to
fairly compare ever improving MIAs on synthetic data.
Related papers
- Defining 'Good': Evaluation Framework for Synthetic Smart Meter Data [14.779917834583577]
We show that standard privacy attack methods are inadequate for assessing privacy risks of smart meter datasets.
We propose an improved method by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers.
arXiv Detail & Related papers (2024-07-16T14:41:27Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against "Truly Anonymous" Synthetic Datasets [12.730435519914415]
We examine the privacy metrics used in real-world synthetic data deployments and demonstrate their unreliability in several ways.
We introduce ReconSyn, a reconstruction attack that generates multiple synthetic datasets that are considered private by the metrics but actually leak unique information to individual records.
We show that ReconSyn recovers 78-100% of the outliers in the train data with only black-box access to a single fitted generative model and the privacy metrics.
arXiv Detail & Related papers (2023-12-08T15:42:28Z) - Partition-based differentially private synthetic data generation [0.5095097384893414]
We present a partition-based approach that reduces errors and improves the quality of synthetic data, even with a limited privacy budget.
The synthetic data produced using our approach exhibits improved quality and utility, making it a preferable choice for private synthetic data sharing.
arXiv Detail & Related papers (2023-10-10T07:23:37Z) - SoK: Privacy-Preserving Data Synthesis [72.92263073534899]
This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field.
We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
arXiv Detail & Related papers (2023-07-05T08:29:31Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - No Free Lunch in "Privacy for Free: How does Dataset Condensation Help
Privacy" [75.98836424725437]
New methods designed to preserve data privacy require careful scrutiny.
Failure to preserve privacy is hard to detect, and yet can lead to catastrophic results when a system implementing a privacy-preserving'' method is attacked.
arXiv Detail & Related papers (2022-09-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.