Are Data Experts Buying into Differentially Private Synthetic Data? Gathering Community Perspectives
- URL: http://arxiv.org/abs/2412.13030v1
- Date: Tue, 17 Dec 2024 15:50:14 GMT
- Title: Are Data Experts Buying into Differentially Private Synthetic Data? Gathering Community Perspectives
- Authors: Lucas Rosenblatt, Bill Howe, Julia Stoyanovich,
- Abstract summary: In the United States, differential privacy (DP) is the dominant technical operationalization of privacy-preserving data analysis.<n>This study qualitatively examines one class of DP mechanisms: private data synthesizers.
- Score: 14.736115103446101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data privacy is a core tenet of responsible computing, and in the United States, differential privacy (DP) is the dominant technical operationalization of privacy-preserving data analysis. With this study, we qualitatively examine one class of DP mechanisms: private data synthesizers. To that end, we conducted semi-structured interviews with data experts: academics and practitioners who regularly work with data. Broadly, our findings suggest that quantitative DP benchmarks must be grounded in practitioner needs, while communication challenges persist. Participants expressed a need for context-aware DP solutions, focusing on parity between research outcomes on real and synthetic data. Our analysis led to three recommendations: (1) improve existing insufficient sanitized benchmarks; successful DP implementations require well-documented, partner-vetted use cases, (2) organizations using DP synthetic data should publish discipline-specific standards of evidence, and (3) tiered data access models could allow researchers to gradually access sensitive data based on demonstrated competence with high-privacy, low-fidelity synthetic data.
Related papers
- A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.
Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.
We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z) - SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy [0.0]
We investigate capability of Large Language Models (Ms) to generate synthetic datasets with Differential Privacy (DP) mechanisms.
Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process.
We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data.
arXiv Detail & Related papers (2024-12-30T01:10:10Z) - Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.<n>We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains [9.123834467375532]
We explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in high-stakes domains.
Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data.
arXiv Detail & Related papers (2024-10-10T19:31:02Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - Differentially Private Data Generation with Missing Data [25.242190235853595]
We formalize the problems of differential privacy (DP) synthetic data with missing values.
We propose three effective adaptive strategies that significantly improve the utility of the synthetic data.
Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms.
arXiv Detail & Related papers (2023-10-17T19:41:54Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora.
For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination.
We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Don't Look at the Data! How Differential Privacy Reconfigures the
Practices of Data Science [0.0]
Differential privacy (DP) is one promising way to offer privacy along with open access.
We conduct interviews with 19 data practitioners who are non-experts in DP.
We find that while DP is promising for providing wider access to sensitive datasets, it also introduces challenges into every stage of the data science workflow.
arXiv Detail & Related papers (2023-02-23T04:28:14Z) - DP2-Pub: Differentially Private High-Dimensional Data Publication with
Invariant Post Randomization [58.155151571362914]
We propose a differentially private high-dimensional data publication mechanism (DP2-Pub) that runs in two phases.
splitting attributes into several low-dimensional clusters with high intra-cluster cohesion and low inter-cluster coupling helps obtain a reasonable privacy budget.
We also extend our DP2-Pub mechanism to the scenario with a semi-honest server which satisfies local differential privacy.
arXiv Detail & Related papers (2022-08-24T17:52:43Z) - Fidelity and Privacy of Synthetic Medical Data [0.0]
The digitization of medical records ushered in a new era of big data to clinical science.
The need to share individual-level medical data continues to grow, and has never been more urgent.
enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy.
arXiv Detail & Related papers (2021-01-18T23:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.