DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms
- URL: http://arxiv.org/abs/2309.08574v1
- Date: Fri, 15 Sep 2023 17:38:59 GMT
- Title: DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms
- Authors: Shweta Patwa, Danyu Sun, Amir Gilad, Ashwin Machanavajjhala, Sudeepa Roy,
- Abstract summary: We present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other.
We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
- Score: 17.562365686511818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. When synthetic data is generated, users may be interested in knowing if their aggregated queries generating such statistics can be reliably answered on the synthetic data, for instance, to decide if the synthetic data is suitable for specific tasks. However, the standard data generation systems do not provide "per-query" quality guarantees on the synthetic data, and the users have no way of knowing how much the aggregated statistics on the synthetic data can be trusted. To address this problem, we present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other while guaranteeing differential privacy. We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
Related papers
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets [0.0]
We study the applicability of procedures based on combining rules to the analysis of DIPS datasets.
Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.
arXiv Detail & Related papers (2024-05-08T02:33:35Z) - Does Differentially Private Synthetic Data Lead to Synthetic Discoveries? [1.9573380763700712]
The evaluation is conducted in terms of the tests' Type I and Type II errors.
A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $epsilonleq 1$.
arXiv Detail & Related papers (2024-03-20T14:03:57Z) - Preserving correlations: A statistical method for generating synthetic
data [0.0]
We propose a method to generate statistically representative synthetic data.
The main goal is to be able to maintain in the synthetic dataset the correlations of the features present in the original one.
We describe in detail our algorithm used both for the analysis of the original dataset and for the generation of the synthetic data points.
arXiv Detail & Related papers (2024-03-03T10:35:46Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - Noise-Aware Statistical Inference with Differentially Private Synthetic
Data [0.0]
We show that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities.
We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation, and synthetic data generation.
We develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy.
arXiv Detail & Related papers (2022-05-28T16:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.