Private Synthetic Data with Hierarchical Structure
- URL: http://arxiv.org/abs/2206.05942v1
- Date: Mon, 13 Jun 2022 07:22:21 GMT
- Title: Private Synthetic Data with Hierarchical Structure
- Authors: Terrance Liu, Zhiwei Steven Wu
- Abstract summary: We study the problem of differentially private synthetic data generation for hierarchical datasets in which individual data points are grouped together.
In particular, to measure the similarity between the synthetic dataset and the underlying private one, we frame our objective under the problem of private query release.
We introduce private synthetic data algorithms for hierarchical query release and evaluate them on hierarchical datasets.
- Score: 33.72123440111452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of differentially private synthetic data generation for
hierarchical datasets in which individual data points are grouped together
(e.g., people within households). In particular, to measure the similarity
between the synthetic dataset and the underlying private one, we frame our
objective under the problem of private query release, generating a synthetic
dataset that preserves answers for some collection of queries (i.e., statistics
like mean aggregate counts). However, while the application of private
synthetic data to the problem of query release has been well studied, such
research is restricted to non-hierarchical data domains, raising the initial
question -- what queries are important when considering data of this form?
Moreover, it has not yet been established how one can generate synthetic data
at both the group and individual-level while capturing such statistics. In
light of these challenges, we first formalize the problem of hierarchical query
release, in which the goal is to release a collection of statistics for some
hierarchical dataset. Specifically, we provide a general set of statistical
queries that captures relationships between attributes at both the group and
individual-level. Subsequently, we introduce private synthetic data algorithms
for hierarchical query release and evaluate them on hierarchical datasets
derived from the American Community Survey and Allegheny Family Screening Tool
data. Finally, we look to the American Community Survey, whose inherent
hierarchical structure gives rise to another set of domain-specific queries
that we run experiments with.
Related papers
- Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights.
Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data.
Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z) - Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets [0.0]
We study the applicability of procedures based on combining rules to the analysis of DIPS datasets.
Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.
arXiv Detail & Related papers (2024-05-08T02:33:35Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms [17.562365686511818]
We present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other.
We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
arXiv Detail & Related papers (2023-09-15T17:38:59Z) - Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections [19.148874215745135]
We study the problem of continually releasing differentially private synthetic data from longitudinal data collections.
We introduce a model where, in every time step, each individual reports a new data element.
We give continual synthetic data generation algorithms that preserve two basic types of queries.
arXiv Detail & Related papers (2023-06-13T16:22:08Z) - GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using
Macro Data Sources [21.32471030724983]
Individual-level data (microdata) that characterizes a population is essential for studying many real-world problems.
In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data.
arXiv Detail & Related papers (2022-12-08T01:22:12Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Differential Privacy of Hierarchical Census Data: An Optimization
Approach [53.29035917495491]
Census Bureaus are interested in releasing aggregate socio-economic data about a large population without revealing sensitive information about any individual.
Recent events have identified some of the privacy challenges faced by these organizations.
This paper presents a novel differential-privacy mechanism for releasing hierarchical counts of individuals.
arXiv Detail & Related papers (2020-06-28T18:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.