Generation and Simulation of Synthetic Datasets with Copulas
- URL: http://arxiv.org/abs/2203.17250v1
- Date: Wed, 30 Mar 2022 13:22:44 GMT
- Title: Generation and Simulation of Synthetic Datasets with Copulas
- Authors: Regis Houssou, Mihai-Cezar Augustin, Efstratios Rappos, Vivien Bonvin
and Stephan Robert-Nicoud
- Abstract summary: We present a complete and reliable algorithm for generating a synthetic data set comprising numeric or categorical variables.
Applying our methodology to two datasets shows better performance compared to other methods such as SMOTE and autoencoders.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a new method to generate synthetic data sets based on
copula models. Our goal is to produce surrogate data resembling real data in
terms of marginal and joint distributions. We present a complete and reliable
algorithm for generating a synthetic data set comprising numeric or categorical
variables. Applying our methodology to two datasets shows better performance
compared to other methods such as SMOTE and autoencoders.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Stable Diffusion Dataset Generation for Downstream Classification Tasks [4.470499157873342]
This paper explores the adaptation of the Stable Diffusion 2.0 model for generating synthetic datasets.
We present a class-conditional version of the model that exploits a Class-Encoder and optimisation of key generation parameters.
Our methodology led to synthetic datasets that, in a third of cases, produced models that outperformed those trained on real datasets.
arXiv Detail & Related papers (2024-05-04T15:37:22Z) - MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation [0.0]
We provide new algorithms for two tasks relating to heterogeneous datasets: clustering, and synthetic data generation.
We demonstrate a novel EM-based clustering algorithm, MMM, that outperforms standard algorithms in determining clusters in synthetic heterogeneous data.
We also demonstrate a synthetic data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data.
arXiv Detail & Related papers (2023-10-30T11:26:01Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms [5.366354612549173]
We focus on data-synthesis methods to create high-quality synthetic data.
We present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data.
Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
arXiv Detail & Related papers (2022-04-08T07:48:57Z) - SYNC: A Copula based Framework for Generating Synthetic Data from
Aggregated Sources [8.350531869939351]
We study synthetic data generation task called downscaling.
We propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula)
We make four key contributions in this work.
arXiv Detail & Related papers (2020-09-20T16:36:25Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.