Differentially Private Verification of Survey-Weighted Estimates
- URL: http://arxiv.org/abs/2404.02519v1
- Date: Wed, 3 Apr 2024 07:12:18 GMT
- Title: Differentially Private Verification of Survey-Weighted Estimates
- Authors: Tong Lin, Jerome P. Reiter,
- Abstract summary: Several official statistics agencies release synthetic data as public use microdata files.
One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data.
We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design.
- Score: 0.5985204759362747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several official statistics agencies release synthetic data as public use microdata files. In practice, synthetic data do not admit accurate results for every analysis. Thus, it is beneficial for agencies to provide users with feedback on the quality of their analyses of the synthetic data. One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data. However, such measures leak information about the confidential records, so that agencies may wish to apply disclosure control methods to the released verification measures. We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design. We illustrate the verification measure using repeated sampling simulations where the confidential data are sampled with a probability proportional to size design, and the analyst estimates a population total or mean with the synthetic data. The simulations suggest that the verification measures can provide useful information about the quality of synthetic data inferences.
Related papers
- Fairness Issues and Mitigations in (Differentially Private) Socio-Demographic Data Processes [43.07159967207698]
This paper shows that surveys of important societal relevance introduce sampling errors that unevenly impact group-level estimates.
To address these issues, this paper introduces an optimization approach modeled on real-world survey design processes.
Privacy-preserving methods used to determine sampling rates can further impact these fairness issues.
arXiv Detail & Related papers (2024-08-16T01:13:36Z) - Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets [0.0]
We study the applicability of procedures based on combining rules to the analysis of DIPS datasets.
Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.
arXiv Detail & Related papers (2024-05-08T02:33:35Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms [17.562365686511818]
We present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other.
We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
arXiv Detail & Related papers (2023-09-15T17:38:59Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - Comparing the Utility and Disclosure Risk of Synthetic Data with Samples
of Microdata [0.6445605125467572]
There is no consensus on how to measure the associated utility and disclosure risk of the data.
The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible.
The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions.
arXiv Detail & Related papers (2022-07-02T20:38:29Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic
Data [0.0]
AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting data sharing.
We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions.
arXiv Detail & Related papers (2021-04-01T17:30:23Z) - Estimating informativeness of samples with Smooth Unique Information [108.25192785062367]
We measure how much a sample informs the final weights and how much it informs the function computed by the weights.
We give efficient approximations of these quantities using a linearized network.
We apply these measures to several problems, such as dataset summarization.
arXiv Detail & Related papers (2021-01-17T10:29:29Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.