Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial
Networks
- URL: http://arxiv.org/abs/2104.10680v1
- Date: Wed, 21 Apr 2021 17:59:41 GMT
- Title: Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial
Networks
- Authors: Bingyang Wen, Luis Oliveros Colon, K.P. Subbalakshmi and R.
Chandramouli
- Abstract summary: We propose a causal model named Causal Tabular Generative Neural Network (Causal-TGAN) to generate synthetic data.
Experiments on both simulated datasets and real datasets demonstrate the better performance of our method.
- Score: 7.232789848964222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic data generation becomes prevalent as a solution to privacy leakage
and data shortage. Generative models are designed to generate a realistic
synthetic dataset, which can precisely express the data distribution for the
real dataset. The generative adversarial networks (GAN), which gain great
success in the computer vision fields, are doubtlessly used for synthetic data
generation. Though there are prior works that have demonstrated great progress,
most of them learn the correlations in the data distributions rather than the
true processes in which the datasets are naturally generated. Correlation is
not reliable for it is a statistical technique that only tells linear
dependencies and is easily affected by the dataset's bias. Causality, which
encodes all underlying factors of how the real data be naturally generated, is
more reliable than correlation. In this work, we propose a causal model named
Causal Tabular Generative Neural Network (Causal-TGAN) to generate synthetic
tabular data using the tabular data's causal information. Extensive experiments
on both simulated datasets and real datasets demonstrate the better performance
of our method when given the true causal graph and a comparable performance
when using the estimated causal graph.
Related papers
- Marginal Causal Flows for Validation and Inference [3.547529079746247]
Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging.
We introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process.
We demonstrate the above with experiments on both simulated and real-world datasets.
arXiv Detail & Related papers (2024-11-02T16:04:57Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Fake It Till Make It: Federated Learning with Consensus-Oriented
Generation [52.82176415223988]
We propose federated learning with consensus-oriented generation (FedCOG)
FedCOG consists of two key components at the client side: complementary data generation and knowledge-distillation-based model training.
Experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T18:49:59Z) - CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular
Data Synthesis [0.4999814847776097]
Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data.
The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
arXiv Detail & Related papers (2023-07-01T16:52:18Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery [11.81479419498206]
In software testing, data privacy, imbalanced learning, and artificial intelligence explanation, it is crucial to generate plausible data samples.
A common assumption of approaches widely used for data generation is the independence of the features.
We propose a synthetic dataset generator that can discover nonlinear causalities among the variables and use them at generation time.
arXiv Detail & Related papers (2023-01-18T10:54:06Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Federated Causal Discovery [74.37739054932733]
This paper develops a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD)
It can learn the causal graph without directly touching local data and naturally handle the data heterogeneity.
Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
arXiv Detail & Related papers (2021-12-07T08:04:12Z) - Copula Flows for Synthetic Data Generation [0.5801044612920815]
We propose to use a probabilistic model as a synthetic data generator.
We benchmark our method on both simulated and real data-sets in terms of density estimation.
arXiv Detail & Related papers (2021-01-03T10:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.