Quality Degradation Attack in Synthetic Data
- URL: http://arxiv.org/abs/2601.02947v1
- Date: Tue, 06 Jan 2026 11:43:31 GMT
- Title: Quality Degradation Attack in Synthetic Data
- Authors: Qinyi Liu, Dong Liu, Farhad Vadiee, Mohammad Khalil, Pedro P. Vergara Barrios,
- Abstract summary: This study investigates quality attacks initiated by adversaries who possess access to the real dataset or control over the generation process.<n>We formalize a corresponding threat model and empirically evaluate the effectiveness of targeted manipulations of real data.
- Score: 5.461072909384133
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Synthetic Data Generation (SDG) can be used to facilitate privacy-preserving data sharing. However, most existing research focuses on privacy attacks where the adversary is the recipient of the released synthetic data and attempts to infer sensitive information from it. This study investigates quality degradation attacks initiated by adversaries who possess access to the real dataset or control over the generation process, such as the data owner, the synthetic data provider, or potential intruders. We formalize a corresponding threat model and empirically evaluate the effectiveness of targeted manipulations of real data (e.g., label flipping and feature-importance-based interventions) on the quality of generated synthetic data. The results show that even small perturbations can substantially reduce downstream predictive performance and increase statistical divergence, exposing vulnerabilities within SDG pipelines. This study highlights the need to integrate integrity verification and robustness mechanisms, alongside privacy protection, to ensure the reliability and trustworthiness of synthetic data sharing frameworks.
Related papers
- Rethinking Anonymity Claims in Synthetic Data Generation: A Model-Centric Privacy Attack Perspective [18.404146545866812]
Training generative machine learning models to produce synthetic data has become a popular approach for enhancing privacy in data sharing.<n>As this typically involves processing sensitive personal information, releasing either the trained model or generated synthetic anonymity can still pose privacy risks.<n>We argue that meaningful assessments must account for the capabilities and properties of underlying generative model and be grounded in state-of-the-art privacy attacks.
arXiv Detail & Related papers (2026-01-30T00:57:41Z) - How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - On the MIA Vulnerability Gap Between Private GANs and Diffusion Models [51.53790101362898]
Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis.<n>We present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models.
arXiv Detail & Related papers (2025-09-03T14:18:22Z) - Privacy Auditing Synthetic Data Release through Local Likelihood Attacks [7.780592134085148]
Gene Likelihood Ratio Attack (Gen-LRA)<n>Gen-LRA formulates its attack by evaluating the influence a test observation has in a surrogate model's estimation of a local likelihood ratio over the synthetic data.<n>Results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data.
arXiv Detail & Related papers (2025-08-28T18:27:40Z) - SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data [13.699107354397286]
We show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss.<n>We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
arXiv Detail & Related papers (2025-06-02T17:27:10Z) - Quantitative Auditing of AI Fairness with Differentially Private Synthetic Data [0.30693357740321775]
Fairness auditing of AI systems can identify and quantify biases.<n>Traditional auditing using real-world data raises security and privacy concerns.<n>We propose a framework that leverages differentially private synthetic data to audit the fairness of AI systems.
arXiv Detail & Related papers (2025-04-30T13:36:27Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.<n>RAG systems may face severe privacy risks when retrieving private data.<n>We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Instance-Level Safety-Aware Fidelity of Synthetic Data and Its Calibration [5.089356301032639]
We focus on its role in safety-critical applications, introducing four types of instance-level fidelity.
The aim is to ensure that applying testing on synthetic data can reveal real-world safety issues.
arXiv Detail & Related papers (2024-02-10T19:45:40Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - The Use of Synthetic Data to Train AI Models: Opportunities and Risks
for Sustainable Development [0.6906005491572401]
This paper investigates the policies governing the creation, utilization, and dissemination of synthetic data.
A well crafted synthetic data policy must strike a balance between privacy concerns and the utility of data.
arXiv Detail & Related papers (2023-08-31T23:18:53Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.