SYNC: A Copula based Framework for Generating Synthetic Data from
Aggregated Sources
- URL: http://arxiv.org/abs/2009.09471v1
- Date: Sun, 20 Sep 2020 16:36:25 GMT
- Title: SYNC: A Copula based Framework for Generating Synthetic Data from
Aggregated Sources
- Authors: Zheng Li, Yue Zhao, Jialin Fu
- Abstract summary: We study synthetic data generation task called downscaling.
We propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula)
We make four key contributions in this work.
- Score: 8.350531869939351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A synthetic dataset is a data object that is generated programmatically, and
it may be valuable to creating a single dataset from multiple sources when
direct collection is difficult or costly. Although it is a fundamental step for
many data science tasks, an efficient and standard framework is absent. In this
paper, we study a specific synthetic data generation task called downscaling, a
procedure to infer high-resolution, harder-to-collect information (e.g.,
individual level records) from many low-resolution, easy-to-collect sources,
and propose a multi-stage framework called SYNC (Synthetic Data Generation via
Gaussian Copula). For given low-resolution datasets, the central idea of SYNC
is to fit Gaussian copula models to each of the low-resolution datasets in
order to correctly capture dependencies and marginal distributions, and then
sample from the fitted models to obtain the desired high-resolution subsets.
Predictive models are then used to merge sampled subsets into one, and finally,
sampled datasets are scaled according to low-resolution marginal constraints.
We make four key contributions in this work: 1) propose a novel framework for
generating individual level data from aggregated data sources by combining
state-of-the-art machine learning and statistical techniques, 2) perform
simulation studies to validate SYNC's performance as a synthetic data
generation algorithm, 3) demonstrate its value as a feature engineering tool,
as well as an alternative to data collection in situations where gathering is
difficult through two real-world datasets, 4) release an easy-to-use framework
implementation for reproducibility and scalability at the production level that
easily incorporates new data.
Related papers
- Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic.
Our approach transforms numerical data into text, re-framing data generation as a language modeling task.
Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - CTSyn: A Foundational Model for Cross Tabular Data Generation [9.568990880984813]
Cross-Table Synthesizer (CTSyn) is a diffusion-based foundational model tailored for tabular data generation.
CTSyn significantly outperforms existing table synthesizers in utility and diversity.
It also uniquely enhances performances of downstream machine learning beyond what is achievable with real data.
arXiv Detail & Related papers (2024-06-07T04:04:21Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Generation and Simulation of Synthetic Datasets with Copulas [0.0]
We present a complete and reliable algorithm for generating a synthetic data set comprising numeric or categorical variables.
Applying our methodology to two datasets shows better performance compared to other methods such as SMOTE and autoencoders.
arXiv Detail & Related papers (2022-03-30T13:22:44Z) - Data Augmentation for Abstractive Query-Focused Multi-Document
Summarization [129.96147867496205]
We present two QMDS training datasets, which we construct using two data augmentation methods.
These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries.
We build end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets.
arXiv Detail & Related papers (2021-03-02T16:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.