GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using
Macro Data Sources
- URL: http://arxiv.org/abs/2212.05975v1
- Date: Thu, 8 Dec 2022 01:22:12 GMT
- Title: GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using
Macro Data Sources
- Authors: Angeela Acharya, Siddhartha Sikdar, Sanmay Das, and Huzefa Rangwala
- Abstract summary: Individual-level data (microdata) that characterizes a population is essential for studying many real-world problems.
In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data.
- Score: 21.32471030724983
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Individual-level data (microdata) that characterizes a population, is
essential for studying many real-world problems. However, acquiring such data
is not straightforward due to cost and privacy constraints, and access is often
limited to aggregated data (macro data) sources. In this study, we examine
synthetic data generation as a tool to extrapolate difficult-to-obtain
high-resolution data by combining information from multiple easier-to-obtain
lower-resolution data sources. In particular, we introduce a framework that
uses a combination of univariate and multivariate frequency tables from a given
target geographical location in combination with frequency tables from other
auxiliary locations to generate synthetic microdata for individuals in the
target location. Our method combines the estimation of a dependency graph and
conditional probabilities from the target location with the use of a Gaussian
copula to leverage the available information from the auxiliary locations. We
perform extensive testing on two real-world datasets and demonstrate that our
approach outperforms prior approaches in preserving the overall dependency
structure of the data while also satisfying the constraints defined on the
different variables.
Related papers
- Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - Collaborative Learning From Distributed Data With Differentially Private
Synthetic Twin Data [15.033125153840308]
We propose a framework in which each party shares a differentially private synthetic twin of their data.
We study the feasibility of combining such synthetic twin data sets for collaborative learning on real-world health data from the UK Biobank.
arXiv Detail & Related papers (2023-08-09T07:47:12Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections [19.148874215745135]
We study the problem of continually releasing differentially private synthetic data from longitudinal data collections.
We introduce a model where, in every time step, each individual reports a new data element.
We give continual synthetic data generation algorithms that preserve two basic types of queries.
arXiv Detail & Related papers (2023-06-13T16:22:08Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Local Learning Matters: Rethinking Data Heterogeneity in Federated
Learning [61.488646649045215]
Federated learning (FL) is a promising strategy for performing privacy-preserving, distributed learning with a network of clients (i.e., edge devices)
arXiv Detail & Related papers (2021-11-28T19:03:39Z) - Multi-modal AsynDGAN: Learn From Distributed Medical Image Data without
Sharing Private Information [55.866673486753115]
We propose an extendable and elastic learning framework to preserve privacy and security.
The proposed framework is named distributed Asynchronized Discriminator Generative Adrial Networks (AsynDGAN)
arXiv Detail & Related papers (2020-12-15T20:41:24Z) - SYNC: A Copula based Framework for Generating Synthetic Data from
Aggregated Sources [8.350531869939351]
We study synthetic data generation task called downscaling.
We propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula)
We make four key contributions in this work.
arXiv Detail & Related papers (2020-09-20T16:36:25Z) - Meta-analysis of heterogeneous data: integrative sparse regression in
high-dimensions [21.162280861396205]
We consider the task of meta-analysis in high-dimensional settings in which the data sources are similar but non-identical.
We introduce a global parameter that emphasizes interpretability and statistical efficiency in the presence of heterogeneity.
We demonstrate the benefits of our approach on a large-scale drug treatment dataset involving several different cancer cell-lines.
arXiv Detail & Related papers (2019-12-26T20:30:57Z) - Distributed Multivariate Regression Modeling For Selecting Biomarkers
Under Data Protection Constraints [0.0]
We propose a multivariable regression approach for identifying biomarkers by automatic variable selection based on aggregated data in iterative calls.
The approach can be used to jointly analyze data distributed across several locations.
In a simulation, the information loss introduced by local standardization is seen to be minimal.
arXiv Detail & Related papers (2018-03-01T15:04:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.