Noise-Aware Statistical Inference with Differentially Private Synthetic
Data
- URL: http://arxiv.org/abs/2205.14485v1
- Date: Sat, 28 May 2022 16:59:46 GMT
- Title: Noise-Aware Statistical Inference with Differentially Private Synthetic
Data
- Authors: Ossi R\"ais\"a (1), Joonas J\"alk\"o (2), Samuel Kaski (2 and 3),
Antti Honkela (1) ((1) University of Helsinki, (2) Aalto University, (3)
University of Manchester)
- Abstract summary: We show that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities.
We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation, and synthetic data generation.
We develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While generation of synthetic data under differential privacy (DP) has
received a lot of attention in the data privacy community, analysis of
synthetic data has received much less. Existing work has shown that simply
analysing DP synthetic data as if it were real does not produce valid
inferences of population-level quantities. For example, confidence intervals
become too narrow, which we demonstrate with a simple experiment. We tackle
this problem by combining synthetic data analysis techniques from the field of
multiple imputation, and synthetic data generation using noise-aware Bayesian
modeling into a pipeline NA+MI that allows computing accurate uncertainty
estimates for population-level quantities from DP synthetic data. To implement
NA+MI for discrete data generation from marginal queries, we develop a novel
noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of
maximum entropy. Our experiments demonstrate that the pipeline is able to
produce accurate confidence intervals from DP synthetic data. The intervals
become wider with tighter privacy to accurately capture the additional
uncertainty stemming from DP noise.
Related papers
- Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning [16.04405606517753]
Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL)
We introduce a novel data-adaptive differentially private algorithm called AdaDPSyn to generate synthetic examples from a private dataset.
AdaDPSyn adaptively adjusts the noise level in the data synthesis mechanism according to the inherent statistical properties of the data.
arXiv Detail & Related papers (2024-10-15T22:06:30Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Does Differentially Private Synthetic Data Lead to Synthetic Discoveries? [1.9573380763700712]
The evaluation is conducted in terms of the tests' Type I and Type II errors.
A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $epsilonleq 1$.
arXiv Detail & Related papers (2024-03-20T14:03:57Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms [17.562365686511818]
We present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other.
We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
arXiv Detail & Related papers (2023-09-15T17:38:59Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Statistical Theory of Differentially Private Marginal-based Data
Synthesis Algorithms [30.330715718619874]
Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST)
Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature.
arXiv Detail & Related papers (2023-01-21T01:32:58Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Learning Summary Statistics for Bayesian Inference with Autoencoders [58.720142291102135]
We use the inner dimension of deep neural network based Autoencoders as summary statistics.
To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information that has been used to generate the training data.
arXiv Detail & Related papers (2022-01-28T12:00:31Z) - Differentially Private Federated Learning with Laplacian Smoothing [72.85272874099644]
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users.
An adversary may still be able to infer the private training data by attacking the released model.
Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models.
arXiv Detail & Related papers (2020-05-01T04:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.