Related papers: Synthetic Data -- Anonymisation Groundhog Day

Synthetic Data -- Anonymisation Groundhog Day

URL: http://arxiv.org/abs/2011.07018v6
Date: Mon, 24 Jan 2022 10:32:35 GMT
Title: Synthetic Data -- Anonymisation Groundhog Day
Authors: Theresa Stadler, Bristena Oprisanu, Carmela Troncoso
Abstract summary: We present the first quantitative evaluation of the privacy gain of synthetic data publishing. We show that synthetic data does not prevent inference attacks or does not retain data utility. In contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict.
Score: 4.694549066382216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques. Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymisation techniques. Furthermore, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict. Because it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss. In summary, we find that synthetic data is far from the holy grail of privacy-preserving data publishing.

Related papers

Differentially Private Synthetic Data Release for Topics API Outputs [63.79476766779742]
We focus on one Privacy-Preserving Ads API: the Topics API, part of Google Chrome's Privacy Sandbox.<n>We generate a differentially-private dataset that closely matches the re-identification risk properties of the real Topics API data.<n>We hope this will enable external researchers to analyze the API in-depth and replicate prior and future work on a realistic large-scale dataset.
arXiv Detail & Related papers (2025-06-30T13:46:57Z)
SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data [13.699107354397286]
We show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss.<n>We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
arXiv Detail & Related papers (2025-06-02T17:27:10Z)
A Consensus Privacy Metrics Framework for Synthetic Data [13.972528788909813]
There is no consolidated standard for measuring privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable.
arXiv Detail & Related papers (2025-03-06T21:19:02Z)
Synthetic Data Privacy Metrics [2.1213500139850017]
We review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create.
arXiv Detail & Related papers (2025-01-07T17:02:33Z)
Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z)
Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains [9.123834467375532]
We explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in high-stakes domains. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data.
arXiv Detail & Related papers (2024-10-10T19:31:02Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. RAG systems may face severe privacy risks when retrieving private data. We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
Practical considerations on using private sampling for synthetic data [1.3654846342364308]
Differential privacy for synthetic data generation has received much attention due to the ability of preserving privacy while freely using the synthetic data. Private sampling is the first noise-free method to construct differentially private synthetic data with rigorous bounds for privacy and accuracy. We provide an implementation of the private sampling algorithm and discuss the realism of its constraints in practical cases.
arXiv Detail & Related papers (2023-12-12T10:20:04Z)
The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against "Truly Anonymous" Synthetic Datasets [12.730435519914415]
We examine the privacy metrics used in real-world synthetic data deployments and demonstrate their unreliability in several ways. We introduce ReconSyn, a reconstruction attack that generates multiple synthetic datasets that are considered private by the metrics but actually leak unique information to individual records. We show that ReconSyn recovers 78-100% of the outliers in the train data with only black-box access to a single fitted generative model and the privacy metrics.
arXiv Detail & Related papers (2023-12-08T15:42:28Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development [0.6906005491572401]
This paper investigates the policies governing the creation, utilization, and dissemination of synthetic data. A well crafted synthetic data policy must strike a balance between privacy concerns and the utility of data.
arXiv Detail & Related papers (2023-08-31T23:18:53Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process. We generate a representative as well as fair version of the UCI Adult census data set. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
Really Useful Synthetic Data -- A Framework to Evaluate the Quality of Differentially Private Synthetic Data [2.538209532048867]
Recent advances in generating synthetic data that allow to add principled ways of protecting privacy are a crucial step in sharing statistical information in a privacy preserving way. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. We develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective.
arXiv Detail & Related papers (2020-04-16T16:24:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.