Comparing the Utility and Disclosure Risk of Synthetic Data with Samples
of Microdata
- URL: http://arxiv.org/abs/2207.03339v1
- Date: Sat, 2 Jul 2022 20:38:29 GMT
- Title: Comparing the Utility and Disclosure Risk of Synthetic Data with Samples
of Microdata
- Authors: Claire Little, Mark Elliot, Richard Allmendinger
- Abstract summary: There is no consensus on how to measure the associated utility and disclosure risk of the data.
The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible.
The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions.
- Score: 0.6445605125467572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most statistical agencies release randomly selected samples of Census
microdata, usually with sample fractions under 10% and with other forms of
statistical disclosure control (SDC) applied. An alternative to SDC is data
synthesis, which has been attracting growing interest, yet there is no clear
consensus on how to measure the associated utility and disclosure risk of the
data. The ability to produce synthetic Census microdata, where the utility and
associated risks are clearly understood, could mean that more timely and
wider-ranging access to microdata would be possible.
This paper follows on from previous work by the authors which mapped
synthetic Census data on a risk-utility (R-U) map. The paper presents a
framework to measure the utility and disclosure risk of synthetic data by
comparing it to samples of the original data of varying sample fractions,
thereby identifying the sample fraction which has equivalent utility and risk
to the synthetic data. Three commonly used data synthesis packages are compared
with some interesting results. Further work is needed in several directions but
the methodology looks very promising.
Related papers
- Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets [0.0]
We study the applicability of procedures based on combining rules to the analysis of DIPS datasets.
Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.
arXiv Detail & Related papers (2024-05-08T02:33:35Z) - Multi-objective evolutionary GAN for tabular data synthesis [0.873811641236639]
Synthetic data has a key role to play in data sharing by statistical agencies and other generators of statistical data products.
This paper proposes a smart MO evolutionary conditional GAN (SMOE-CTGAN) for synthetic data.
Our results indicate that SMOE-CTGAN is able to discover synthetic datasets with different risk and utility levels for multiple national census datasets.
arXiv Detail & Related papers (2024-04-15T23:07:57Z) - Differentially Private Verification of Survey-Weighted Estimates [0.5985204759362747]
Several official statistics agencies release synthetic data as public use microdata files.
One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data.
We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design.
arXiv Detail & Related papers (2024-04-03T07:12:18Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms [17.562365686511818]
We present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other.
We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
arXiv Detail & Related papers (2023-09-15T17:38:59Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Synthcity: facilitating innovative use cases of synthetic data in
different data modalities [86.52703093858631]
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation.
Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
arXiv Detail & Related papers (2023-01-18T14:49:54Z) - Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study [0.0]
We use a combination of geometry, topology, and robust statistics for hypothesis testing to evaluate the "validity" of generated data.
We additionally contrast the findings with prominent global metric practices described in the literature for large sample size data.
arXiv Detail & Related papers (2022-11-19T18:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.