Related papers: SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources

SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources

URL: http://arxiv.org/abs/2009.09471v1
Date: Sun, 20 Sep 2020 16:36:25 GMT
Title: SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources
Authors: Zheng Li, Yue Zhao, Jialin Fu
Abstract summary: We study synthetic data generation task called downscaling. We propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula) We make four key contributions in this work.
Score: 8.350531869939351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A synthetic dataset is a data object that is generated programmatically, and it may be valuable to creating a single dataset from multiple sources when direct collection is difficult or costly. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In this paper, we study a specific synthetic data generation task called downscaling, a procedure to infer high-resolution, harder-to-collect information (e.g., individual level records) from many low-resolution, easy-to-collect sources, and propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula). For given low-resolution datasets, the central idea of SYNC is to fit Gaussian copula models to each of the low-resolution datasets in order to correctly capture dependencies and marginal distributions, and then sample from the fitted models to obtain the desired high-resolution subsets. Predictive models are then used to merge sampled subsets into one, and finally, sampled datasets are scaled according to low-resolution marginal constraints. We make four key contributions in this work: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of-the-art machine learning and statistical techniques, 2) perform simulation studies to validate SYNC's performance as a synthetic data generation algorithm, 3) demonstrate its value as a feature engineering tool, as well as an alternative to data collection in situations where gathering is difficult through two real-world datasets, 4) release an easy-to-use framework implementation for reproducibility and scalability at the production level that easily incorporates new data.

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
SampleLLM: Optimizing Tabular Data Synthesis in Recommendations [46.689486044254544]
Tabular data synthesis is crucial in machine learning, yet existing general methods are highly data-dependent and often fall short in recommender systems. This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data. We propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based data synthesis for recommendation tasks.
arXiv Detail & Related papers (2025-01-27T15:12:27Z)
Hierarchical Conditional Tabular GAN for Multi-Tabular Synthetic Data Generation [0.0]
We propose the algorithm HCTGAN to synthesize multi-tabular data from complex multi-tabular datasets. Our findings show that our proposed algorithm can more efficiently sample large amounts of synthetic data for deep and complex multi-tabular datasets. We conclude that the HCTGAN algorithm is suitable for generating large amounts of synthetic data efficiently for deep multi-tabular datasets with complex relationships.
arXiv Detail & Related papers (2024-11-11T14:09:26Z)
Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
CTSyn: A Foundational Model for Cross Tabular Data Generation [9.568990880984813]
Cross-Table Synthesizer (CTSyn) is a diffusion-based foundational model tailored for tabular data generation. CTSyn significantly outperforms existing table synthesizers in utility and diversity. It also uniquely enhances performances of downstream machine learning beyond what is achievable with real data.
arXiv Detail & Related papers (2024-06-07T04:04:21Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Generation and Simulation of Synthetic Datasets with Copulas [0.0]
We present a complete and reliable algorithm for generating a synthetic data set comprising numeric or categorical variables. Applying our methodology to two datasets shows better performance compared to other methods such as SMOTE and autoencoders.
arXiv Detail & Related papers (2022-03-30T13:22:44Z)
Data Augmentation for Abstractive Query-Focused Multi-Document Summarization [129.96147867496205]
We present two QMDS training datasets, which we construct using two data augmentation methods. These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries. We build end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets.
arXiv Detail & Related papers (2021-03-02T16:57:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.