Related papers: Rethinking Anonymity Claims in Synthetic Data Generation: A Model-Centric Privacy Attack Perspective

Rethinking Anonymity Claims in Synthetic Data Generation: A Model-Centric Privacy Attack Perspective

URL: http://arxiv.org/abs/2601.22434v1
Date: Fri, 30 Jan 2026 00:57:41 GMT
Title: Rethinking Anonymity Claims in Synthetic Data Generation: A Model-Centric Privacy Attack Perspective
Authors: Georgi Ganev, Emiliano De Cristofaro,
Abstract summary: Training generative machine learning models to produce synthetic data has become a popular approach for enhancing privacy in data sharing.<n>As this typically involves processing sensitive personal information, releasing either the trained model or generated synthetic anonymity can still pose privacy risks.<n>We argue that meaningful assessments must account for the capabilities and properties of underlying generative model and be grounded in state-of-the-art privacy attacks.
Score: 18.404146545866812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training generative machine learning models to produce synthetic tabular data has become a popular approach for enhancing privacy in data sharing. As this typically involves processing sensitive personal information, releasing either the trained model or generated synthetic datasets can still pose privacy risks. Yet, recent research, commercial deployments, and privacy regulations like the General Data Protection Regulation (GDPR) largely assess anonymity at the level of an individual dataset. In this paper, we rethink anonymity claims about synthetic data from a model-centric perspective and argue that meaningful assessments must account for the capabilities and properties of the underlying generative model and be grounded in state-of-the-art privacy attacks. This perspective better reflects real-world products and deployments, where trained models are often readily accessible for interaction or querying. We interpret the GDPR's definitions of personal data and anonymization under such access assumptions to identify the types of identifiability risks that must be mitigated and map them to privacy attacks across different threat settings. We then argue that synthetic data techniques alone do not ensure sufficient anonymization. Finally, we compare the two mechanisms most commonly used alongside synthetic data -- Differential Privacy (DP) and Similarity-based Privacy Metrics (SBPMs) -- and argue that while DP can offer robust protections against identifiability risks, SBPMs lack adequate safeguards. Overall, our work connects regulatory notions of identifiability with model-centric privacy attacks, enabling more responsible and trustworthy regulatory assessment of synthetic data systems by researchers, practitioners, and policymakers.

Related papers

Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework [34.56525983543448]
Synthetic data generation is gaining traction as a privacy enhancing technology.<n>The concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data.
arXiv Detail & Related papers (2025-12-18T08:09:28Z)
How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z)
Synth-MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis [8.4361320391543]
Tabular Generative Models are often argued to preserve privacy by creating synthetic datasets that resemble training data.<n>Membership Inference Attacks (MIAs) have recently emerged as a method for evaluating privacy leakage in synthetic data.<n>We propose a unified, model-agnostic threat framework that deploys a collection of attacks to estimate the maximum empirical privacy leakage in synthetic datasets.
arXiv Detail & Related papers (2025-09-22T16:53:38Z)
On the MIA Vulnerability Gap Between Private GANs and Diffusion Models [51.53790101362898]
Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis.<n>We present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models.
arXiv Detail & Related papers (2025-09-03T14:18:22Z)
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks [7.780592134085148]
Gene Likelihood Ratio Attack (Gen-LRA)<n>Gen-LRA formulates its attack by evaluating the influence a test observation has in a surrogate model's estimation of a local likelihood ratio over the synthetic data.<n>Results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data.
arXiv Detail & Related papers (2025-08-28T18:27:40Z)
RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting [17.294176570269]
We propose a reinforcement learning framework that fine-tunes a large language model (LLM) using a composite reward function.<n>The privacy reward combines semantic cues with structural patterns derived from a minimum spanning tree (MST) over latent representations.<n> Empirical results show that the proposed method significantly enhances author obfuscation and privacy metrics without degrading semantic quality.
arXiv Detail & Related papers (2025-08-25T04:38:19Z)
SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy [0.0]
We investigate capability of Large Language Models (Ms) to generate synthetic datasets with Differential Privacy (DP) mechanisms.<n>Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process.<n>We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data.
arXiv Detail & Related papers (2024-12-30T01:10:10Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.<n>RAG systems may face severe privacy risks when retrieving private data.<n>We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data [1.5293427903448022]
We introduce a new attribute inference attack against synthetic data. We show that our attack can be highly accurate even on arbitrary records. We then evaluate the tradeoff between protecting privacy and preserving statistical utility.
arXiv Detail & Related papers (2023-01-24T14:56:36Z)
Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process. We generate a representative as well as fair version of the UCI Adult census data set. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.