Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead
- URL: http://arxiv.org/abs/2503.20846v1
- Date: Wed, 26 Mar 2025 16:06:33 GMT
- Title: Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead
- Authors: Viktor Schlegel, Anil A Bharath, Zilong Zhao, Kevin Yee,
- Abstract summary: Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains.<n>This survey presents the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods.<n>Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks.
- Score: 7.410975558116122
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.
Related papers
- Quantitative Auditing of AI Fairness with Differentially Private Synthetic Data [0.30693357740321775]
Fairness auditing of AI systems can identify and quantify biases.
Traditional auditing using real-world data raises security and privacy concerns.
We propose a framework that leverages differentially private synthetic data to audit the fairness of AI systems.
arXiv Detail & Related papers (2025-04-30T13:36:27Z) - Differentially Private Federated Learning of Diffusion Models for Synthetic Tabular Data Generation [5.182014186927255]
We introduce DP-Fed-FinDiff framework, a novel integration of Differential Privacy, Federated Learning and Denoising Diffusion Probabilistic Models.<n>We demonstrate the effectiveness of DP-Fed-FinDiff on multiple real-world financial datasets.<n>The results affirm the potential of DP-Fed-FinDiff to enable secure data sharing and robust analytics in highly regulated domains.
arXiv Detail & Related papers (2024-12-20T17:30:58Z) - Synthetic Data: Revisiting the Privacy-Utility Trade-off [4.832355454351479]
An article stated that synthetic data does not provide a better trade-off between privacy and utility than traditional anonymization techniques.<n>We analyzed the implementation of the privacy game described in the article and found that it operated in a highly specialized and constrained environment.
arXiv Detail & Related papers (2024-07-09T14:48:43Z) - An applied Perspective: Estimating the Differential Identifiability Risk of an Exemplary SOEP Data Set [2.66269503676104]
We show how to compute the risk metric efficiently for a set of basic statistical queries.
Our empirical analysis based on an extensive, real-world scientific data set expands the knowledge on how to compute risks under realistic conditions.
arXiv Detail & Related papers (2024-07-04T17:50:55Z) - Collection, usage and privacy of mobility data in the enterprise and public administrations [55.2480439325792]
Security measures such as anonymization are needed to protect individuals' privacy.
Within our study, we conducted expert interviews to gain insights into practices in the field.
We survey privacy-enhancing methods in use, which generally do not comply with state-of-the-art standards of differential privacy.
arXiv Detail & Related papers (2024-07-04T08:29:27Z) - The Data Minimization Principle in Machine Learning [61.17813282782266]
Data minimization aims to reduce the amount of data collected, processed or retained.
It has been endorsed by various global data protection regulations.
However, its practical implementation remains a challenge due to the lack of a rigorous formulation.
arXiv Detail & Related papers (2024-05-29T19:40:27Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - A Summary of Privacy-Preserving Data Publishing in the Local Setting [0.6749750044497732]
Statistical Disclosure Control aims to minimize the risk of exposing confidential information by de-identifying it.
We outline the current privacy-preserving techniques employed in microdata de-identification, delve into privacy measures tailored for various disclosure scenarios, and assess metrics for information loss and predictive performance.
arXiv Detail & Related papers (2023-12-19T04:23:23Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Breaking the Communication-Privacy-Accuracy Tradeoff with
$f$-Differential Privacy [51.11280118806893]
We consider a federated data analytics problem in which a server coordinates the collaborative data analysis of multiple users with privacy concerns and limited communication capability.
We study the local differential privacy guarantees of discrete-valued mechanisms with finite output space through the lens of $f$-differential privacy (DP)
More specifically, we advance the existing literature by deriving tight $f$-DP guarantees for a variety of discrete-valued mechanisms.
arXiv Detail & Related papers (2023-02-19T16:58:53Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - PEARL: Data Synthesis via Private Embeddings and Adversarial
Reconstruction Learning [1.8692254863855962]
We propose a new framework of data using deep generative models in a differentially private manner.
Within our framework, sensitive data are sanitized with rigorous privacy guarantees in a one-shot fashion.
Our proposal has theoretical guarantees of performance, and empirical evaluations on multiple datasets show that our approach outperforms other methods at reasonable levels of privacy.
arXiv Detail & Related papers (2021-06-08T18:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.