Related papers: SoK: Privacy-Preserving Data Synthesis

SoK: Privacy-Preserving Data Synthesis

URL: http://arxiv.org/abs/2307.02106v2
Date: Sat, 5 Aug 2023 06:28:12 GMT
Title: SoK: Privacy-Preserving Data Synthesis
Authors: Yuzheng Hu, Fan Wu, Qinbin Li, Yunhui Long, Gonzalo Munilla Garrido, Chang Ge, Bolin Ding, David Forsyth, Bo Li, Dawn Song
Abstract summary: This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field. We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
Score: 72.92263073534899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the prevalence of data analysis grows, safeguarding data privacy has become a paramount concern. Consequently, there has been an upsurge in the development of mechanisms aimed at privacy-preserving data analyses. However, these approaches are task-specific; designing algorithms for new tasks is a cumbersome process. As an alternative, one can create synthetic data that is (ideally) devoid of private information. This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field. Specifically, we put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods. Under the master recipe, we further dissect the statistical methods into choices of modeling and representation, and investigate the DL-based methods by different generative modeling principles. To consolidate our findings, we provide comprehensive reference tables, distill key takeaways, and identify open problems in the existing literature. In doing so, we aim to answer the following questions: What are the design principles behind different PPDS methods? How can we categorize these methods, and what are the advantages and disadvantages associated with each category? Can we provide guidelines for method selection in different real-world scenarios? We proceed to benchmark several prominent DL-based methods on the task of private image synthesis and conclude that DP-MERF is an all-purpose approach. Finally, upon systematizing the work over the past decade, we identify future directions and call for actions from researchers.

Related papers

SoK: Can Synthetic Images Replace Real Data? A Survey of Utility and Privacy of Synthetic Image Generation [5.2438534296318355]
This study seeks to answer questions in PPDS: Can synthetic data effectively replace real data?<n>With a systematic evaluation of diverse methods, our study provides actionable insights into the utility-privacy tradeoffs of synthetic data generation methods.
arXiv Detail & Related papers (2025-06-24T06:41:34Z)
Differential Privacy in Machine Learning: From Symbolic AI to LLMs [49.1574468325115]
Differential privacy provides a formal framework to mitigate privacy risks.<n>It ensures that the inclusion or exclusion of any single data point does not significantly alter the output of an algorithm.
arXiv Detail & Related papers (2025-06-13T11:30:35Z)
Benchmarking Differentially Private Tabular Data Synthesis [21.320681813245525]
We present a benchmark for evaluating differentDP data synthesis methods. Our evaluation reveals that a significant utility-efficiency trade-off exists among current state-of-the-art methods. We conduct an in-depth analysis of each module with experimental validation, offering theoretical insights into the strengths and limitations of different strategies.
arXiv Detail & Related papers (2025-04-18T20:27:23Z)
Differentially Private Random Feature Model [52.468511541184895]
We produce a differentially private random feature model for privacy-preserving kernel machines. We show that our method preserves privacy and derive a generalization error bound for the method.
arXiv Detail & Related papers (2024-12-06T05:31:08Z)
Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning [31.888075470799908]
We show that even if data in a redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks. We introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference. We find that different pruning methods involve varying levels of privacy leakage, and even the same pruning method can present different privacy risks at different pruning fractions.
arXiv Detail & Related papers (2024-11-24T11:46:59Z)
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models [33.488331159912136]
Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. We present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs.
arXiv Detail & Related papers (2024-08-04T16:50:07Z)
Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z)
Continual Learning with Pre-Trained Models: A Survey [61.97613090666247]
Continual Learning aims to overcome the catastrophic forgetting of former knowledge when learning new ones. This paper presents a comprehensive survey of the latest advancements in PTM-based CL.
arXiv Detail & Related papers (2024-01-29T18:27:52Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)
Methods for generating and evaluating synthetic longitudinal patient data: a systematic review [0.0]
This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods.
arXiv Detail & Related papers (2023-09-21T12:44:31Z)
Differentially Private Linear Regression with Linked Data [3.9325957466009203]
Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses on developing differentially private versions of individual statistical and machine learning tasks. We present two differentially private algorithms for linear regression with linked data.
arXiv Detail & Related papers (2023-08-01T21:00:19Z)
Going beyond research datasets: Novel intent discovery in the industry setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z)
Federated Offline Reinforcement Learning [55.326673977320574]
We propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites. We design the first federated policy optimization algorithm for offline RL with sample complexity. We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed.
arXiv Detail & Related papers (2022-06-11T18:03:26Z)
Privacy preserving n-party scalar product protocol [0.0]
Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. We propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method.
arXiv Detail & Related papers (2021-12-17T11:14:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.