Related papers: On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against "Truly Anonymous Synthetic Data''

On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against "Truly Anonymous Synthetic Data''

URL: http://arxiv.org/abs/2312.05114v1
Date: Fri, 8 Dec 2023 15:42:28 GMT
Title: On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against "Truly Anonymous Synthetic Data''
Authors: Georgi Ganev and Emiliano De Cristofaro
Abstract summary: We review the privacy metrics offered by leading companies in this space and shed light on a few critical flaws in reasoning about privacy entirely via empirical evaluations. We present a reconstruction attack, ReconSyn, which successfully recovers (i.e., leaks all attributes of) at least 78% of the low-density train records (or outliers) with only black-box access to a single fitted generative model and the privacy metrics.
Score: 15.0393231456773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training generative models to produce synthetic data is meant to provide a privacy-friendly approach to data release. However, we get robust guarantees only when models are trained to satisfy Differential Privacy (DP). Alas, this is not the standard in industry as many companies use ad-hoc strategies to empirically evaluate privacy based on the statistical similarity between synthetic and real data. In this paper, we review the privacy metrics offered by leading companies in this space and shed light on a few critical flaws in reasoning about privacy entirely via empirical evaluations. We analyze the undesirable properties of the most popular metrics and filters and demonstrate their unreliability and inconsistency through counter-examples. We then present a reconstruction attack, ReconSyn, which successfully recovers (i.e., leaks all attributes of) at least 78% of the low-density train records (or outliers) with only black-box access to a single fitted generative model and the privacy metrics. Finally, we show that applying DP only to the model or using low-utility generators does not mitigate ReconSyn as the privacy leakage predominantly comes from the metrics. Overall, our work serves as a warning to practitioners not to deviate from established privacy-preserving mechanisms.

Related papers

The DCR Delusion: Measuring the Privacy Risk of Synthetic Data [8.673204690445955]
Membership Inference Attacks (MIAs) are widely considered the gold standard for empirically assessing the privacy of a synthetic dataset.<n>These metrics estimate privacy by measuring the similarity between the training data and generated synthetic data.<n>In this work we show that, while computationally inexpensive, DCR and other distance-based metrics fail to identify privacy leakage.
arXiv Detail & Related papers (2025-05-02T18:21:14Z)
A Consensus Privacy Metrics Framework for Synthetic Data [13.972528788909813]
There is no consolidated standard for measuring privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable.
arXiv Detail & Related papers (2025-03-06T21:19:02Z)
Synthetic Data Privacy Metrics [2.1213500139850017]
We review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create.
arXiv Detail & Related papers (2025-01-07T17:02:33Z)
Defining 'Good': Evaluation Framework for Synthetic Smart Meter Data [14.779917834583577]
We show that standard privacy attack methods are inadequate for assessing privacy risks of smart meter datasets. We propose an improved method by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers.
arXiv Detail & Related papers (2024-07-16T14:41:27Z)
Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing [9.061271587514215]
We propose a principled vulnerable record identification technique for synthetic data publishing. We show it to strongly outperform previous ad-hoc methods across datasets and generators. We show it to accurately identify vulnerable records when synthetic data generators are made differentially private.
arXiv Detail & Related papers (2023-06-17T09:42:46Z)
Differentially Private Synthetic Data Generation via Lipschitz-Regularised Variational Autoencoders [3.7463972693041274]
It is often overlooked that generative models are prone to memorising many details of individual training records. In this paper we explore an alternative approach for privately generating data that makes direct use of the inherentity in generative models.
arXiv Detail & Related papers (2023-04-22T07:24:56Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss. We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z)
Analyzing Privacy Leakage in Machine Learning via Multiple Hypothesis Testing: A Lesson From Fano [83.5933307263932]
We study data reconstruction attacks for discrete data and analyze it under the framework of hypothesis testing. We show that if the underlying private data takes values from a set of size $M$, then the target privacy parameter $epsilon$ can be $O(log M)$ before the adversary gains significant inferential power.
arXiv Detail & Related papers (2022-10-24T23:50:12Z)
No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy" [75.98836424725437]
New methods designed to preserve data privacy require careful scrutiny. Failure to preserve privacy is hard to detect, and yet can lead to catastrophic results when a system implementing a privacy-preserving'' method is attacked.
arXiv Detail & Related papers (2022-09-29T17:50:23Z)
Smooth Anonymity for Sparse Graphs [69.1048938123063]
differential privacy has emerged as the gold standard of privacy, however, when it comes to sharing sparse datasets. In this work, we consider a variation of $k$-anonymity, which we call smooth-$k$-anonymity, and design simple large-scale algorithms that efficiently provide smooth-$k$-anonymity.
arXiv Detail & Related papers (2022-07-13T17:09:25Z)
Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent [69.14164921515949]
We characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees.
arXiv Detail & Related papers (2022-06-06T13:49:37Z)
Synthetic Data -- Anonymisation Groundhog Day [4.694549066382216]
We present the first quantitative evaluation of the privacy gain of synthetic data publishing. We show that synthetic data does not prevent inference attacks or does not retain data utility. In contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict.
arXiv Detail & Related papers (2020-11-13T16:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.