Related papers: A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data

A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data

URL: http://arxiv.org/abs/2301.10053v3
Date: Thu, 9 May 2024 10:35:25 GMT
Title: A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data
Authors: Meenatchi Sundaram Muthu Selva Annamalai, Andrea Gadotti, Luc Rocher,
Abstract summary: We introduce a new attribute inference attack against synthetic data. We show that our attack can be highly accurate even on arbitrary records. We then evaluate the tradeoff between protecting privacy and preserving statistical utility.
Score: 1.5293427903448022
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.

Related papers

Quality Degradation Attack in Synthetic Data [5.461072909384133]
This study investigates quality attacks initiated by adversaries who possess access to the real dataset or control over the generation process.<n>We formalize a corresponding threat model and empirically evaluate the effectiveness of targeted manipulations of real data.
arXiv Detail & Related papers (2026-01-06T11:43:31Z)
How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z)
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler [67.24175911858312]
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models.<n>Bayesian Data Scheduler (BDS) is an adaptive tuning-stage defense strategy with no need for attack simulation.<n>BDS learns the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets.
arXiv Detail & Related papers (2025-10-31T04:49:37Z)
SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling [12.706081519342883]
We conduct the first systematic study of privacy leakage in the Synthetic Minority Over-sampling Technique (SMOTE)<n>We build two novel attacks with very limited assumptions: DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets, and ReconSMOTE, which reconstructs real minority records from synthetic datasets with perfect precision.<n>Our work reveals that SMOTE is inherently non-private and disproportionately exposes minority records, highlighting the need to reconsider its use in privacy-sensitive applications.
arXiv Detail & Related papers (2025-10-16T18:55:46Z)
On the MIA Vulnerability Gap Between Private GANs and Diffusion Models [51.53790101362898]
Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis.<n>We present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models.
arXiv Detail & Related papers (2025-09-03T14:18:22Z)
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks [7.780592134085148]
Gene Likelihood Ratio Attack (Gen-LRA)<n>Gen-LRA formulates its attack by evaluating the influence a test observation has in a surrogate model's estimation of a local likelihood ratio over the synthetic data.<n>Results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data.
arXiv Detail & Related papers (2025-08-28T18:27:40Z)
Improving Noise Efficiency in Privacy-preserving Dataset Distillation [59.57846442477106]
We introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality.<n>On CIFAR-10, our method achieves a textbf10.0% improvement with 50 images per class and textbf8.3% increase with just textbfone-fifth the distilled set size of previous state-of-the-art methods.
arXiv Detail & Related papers (2025-08-03T13:15:52Z)
Benchmarking Fraud Detectors on Private Graph Data [70.4654745317714]
Currently, many types of fraud are managed in part by automated detection algorithms that operate over graphs.<n>We consider the scenario where a data holder wishes to outsource development of fraud detectors to third parties.<n>Third parties submit their fraud detectors to the data holder, who evaluates these algorithms on a private dataset and then publicly communicates the results.<n>We propose a realistic privacy attack on this system that allows an adversary to de-anonymize individuals' data based only on the evaluation results.
arXiv Detail & Related papers (2025-07-30T03:20:15Z)
Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets [40.67424997797513]
Synthetic data has garnered attention as a Privacy Enhancing Technology (PET) in sectors such as healthcare and finance. Similarity-based methods aim at finding the level of similarity between training and synthetic data. Attack-based methods conduce deliberate attacks on synthetic datasets.
arXiv Detail & Related papers (2025-02-19T15:52:23Z)
Differentially Private Random Feature Model [52.468511541184895]
We produce a differentially private random feature model for privacy-preserving kernel machines. We show that our method preserves privacy and derive a generalization error bound for the method.
arXiv Detail & Related papers (2024-12-06T05:31:08Z)
Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner. Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z)
The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against "Truly Anonymous" Synthetic Datasets [12.730435519914415]
We examine the privacy metrics used in real-world synthetic data deployments and demonstrate their unreliability in several ways. We introduce ReconSyn, a reconstruction attack that generates multiple synthetic datasets that are considered private by the metrics but actually leak unique information to individual records. We show that ReconSyn recovers 78-100% of the outliers in the train data with only black-box access to a single fitted generative model and the privacy metrics.
arXiv Detail & Related papers (2023-12-08T15:42:28Z)
Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning. We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z)
Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing [9.061271587514215]
We propose a principled vulnerable record identification technique for synthetic data publishing. We show it to strongly outperform previous ad-hoc methods across datasets and generators. We show it to accurately identify vulnerable records when synthetic data generators are made differentially private.
arXiv Detail & Related papers (2023-06-17T09:42:46Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy" [75.98836424725437]
New methods designed to preserve data privacy require careful scrutiny. Failure to preserve privacy is hard to detect, and yet can lead to catastrophic results when a system implementing a privacy-preserving'' method is attacked.
arXiv Detail & Related papers (2022-09-29T17:50:23Z)
Model Inversion Attacks against Graph Neural Networks [65.35955643325038]
We study model inversion attacks against Graph Neural Networks (GNNs) In this paper, we present GraphMI to infer the private training graph data. Our experimental results show that such defenses are not sufficiently effective and call for more advanced defenses against privacy attacks.
arXiv Detail & Related papers (2022-09-16T09:13:43Z)
Autoregressive Perturbations for Data Poisoning [54.205200221427994]
Data scraping from social media has led to growing concerns regarding unauthorized use of data. Data poisoning attacks have been proposed as a bulwark against scraping. We introduce autoregressive (AR) poisoning, a method that can generate poisoned data without access to the broader dataset.
arXiv Detail & Related papers (2022-06-08T06:24:51Z)
Curse or Redemption? How Data Heterogeneity Affects the Robustness of Federated Learning [51.15273664903583]
Data heterogeneity has been identified as one of the key features in federated learning but often overlooked in the lens of robustness to adversarial attacks. This paper focuses on characterizing and understanding its impact on backdooring attacks in federated learning through comprehensive experiments using synthetic and the LEAF benchmarks.
arXiv Detail & Related papers (2021-02-01T06:06:21Z)
privGAN: Protecting GANs from membership inference attacks at low cost [5.735035463793008]
Generative Adversarial Networks (GANs) have made releasing of synthetic images a viable approach to share data without releasing the original dataset. Recent work has shown that the GAN models and their synthetically generated data can be used to infer the training set membership by an adversary. Here we develop a new GAN architecture (privGAN) where the generator is trained not only to cheat the discriminator but also to defend membership inference attacks.
arXiv Detail & Related papers (2019-12-31T20:47:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.