SoK: Privacy-Preserving Data Synthesis
- URL: http://arxiv.org/abs/2307.02106v2
- Date: Sat, 5 Aug 2023 06:28:12 GMT
- Title: SoK: Privacy-Preserving Data Synthesis
- Authors: Yuzheng Hu, Fan Wu, Qinbin Li, Yunhui Long, Gonzalo Munilla Garrido,
Chang Ge, Bolin Ding, David Forsyth, Bo Li, Dawn Song
- Abstract summary: This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field.
We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
- Score: 72.92263073534899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the prevalence of data analysis grows, safeguarding data privacy has
become a paramount concern. Consequently, there has been an upsurge in the
development of mechanisms aimed at privacy-preserving data analyses. However,
these approaches are task-specific; designing algorithms for new tasks is a
cumbersome process. As an alternative, one can create synthetic data that is
(ideally) devoid of private information. This paper focuses on
privacy-preserving data synthesis (PPDS) by providing a comprehensive overview,
analysis, and discussion of the field. Specifically, we put forth a master
recipe that unifies two prominent strands of research in PPDS: statistical
methods and deep learning (DL)-based methods. Under the master recipe, we
further dissect the statistical methods into choices of modeling and
representation, and investigate the DL-based methods by different generative
modeling principles. To consolidate our findings, we provide comprehensive
reference tables, distill key takeaways, and identify open problems in the
existing literature. In doing so, we aim to answer the following questions:
What are the design principles behind different PPDS methods? How can we
categorize these methods, and what are the advantages and disadvantages
associated with each category? Can we provide guidelines for method selection
in different real-world scenarios? We proceed to benchmark several prominent
DL-based methods on the task of private image synthesis and conclude that
DP-MERF is an all-purpose approach. Finally, upon systematizing the work over
the past decade, we identify future directions and call for actions from
researchers.
Related papers
- Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning [31.888075470799908]
We show that even if data in a redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks.
We introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference.
We find that different pruning methods involve varying levels of privacy leakage, and even the same pruning method can present different privacy risks at different pruning fractions.
arXiv Detail & Related papers (2024-11-24T11:46:59Z) - Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models [33.488331159912136]
Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference.
Data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning.
We present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs.
arXiv Detail & Related papers (2024-08-04T16:50:07Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - Continual Learning with Pre-Trained Models: A Survey [61.97613090666247]
Continual Learning aims to overcome the catastrophic forgetting of former knowledge when learning new ones.
This paper presents a comprehensive survey of the latest advancements in PTM-based CL.
arXiv Detail & Related papers (2024-01-29T18:27:52Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - Methods for generating and evaluating synthetic longitudinal patient
data: a systematic review [0.0]
This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data.
The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022.
The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods.
arXiv Detail & Related papers (2023-09-21T12:44:31Z) - Differentially Private Linear Regression with Linked Data [3.9325957466009203]
Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees.
Recent work focuses on developing differentially private versions of individual statistical and machine learning tasks.
We present two differentially private algorithms for linear regression with linked data.
arXiv Detail & Related papers (2023-08-01T21:00:19Z) - Going beyond research datasets: Novel intent discovery in the industry
setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform.
We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision.
We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z) - Federated Offline Reinforcement Learning [55.326673977320574]
We propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites.
We design the first federated policy optimization algorithm for offline RL with sample complexity.
We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed.
arXiv Detail & Related papers (2022-06-11T18:03:26Z) - Privacy preserving n-party scalar product protocol [0.0]
Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data.
The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility.
We propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method.
arXiv Detail & Related papers (2021-12-17T11:14:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.