CTAB-GAN: Effective Table Data Synthesizing
- URL: http://arxiv.org/abs/2102.08369v1
- Date: Tue, 16 Feb 2021 18:53:57 GMT
- Title: CTAB-GAN: Effective Table Data Synthesizing
- Authors: Zilong Zhao, Aditya Kunar, Hiek Van der Scheer, Robert Birke and Lydia
Y. Chen
- Abstract summary: We develop CTAB-GAN, a conditional table GAN architecture that can model diverse data types.
We show that CTAB-GAN remarkably resembles the real data for all three types of variables and results into higher accuracy for five machine learning algorithms, by up 17%.
- Score: 7.336728307626645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While data sharing is crucial for knowledge development, privacy concerns and
strict regulation (e.g., European General Data Protection Regulation (GDPR))
unfortunately limit its full effectiveness. Synthetic tabular data emerges as
an alternative to enable data sharing while fulfilling regulatory and privacy
constraints. The state-of-the-art tabular data synthesizers draw methodologies
from generative Adversarial Networks (GAN) and address two main data types in
the industry, i.e., continuous and categorical. In this paper, we develop
CTAB-GAN, a novel conditional table GAN architecture that can effectively model
diverse data types, including a mix of continuous and categorical variables.
Moreover, we address data imbalance and long-tail issues, i.e., certain
variables have drastic frequency differences across large values. To achieve
those aims, we first introduce the information loss and classification loss to
the conditional GAN. Secondly, we design a novel conditional vector, which
efficiently encodes the mixed data type and skewed distribution of data
variable. We extensively evaluate CTAB-GAN with the state of the art GANs that
generate synthetic tables, in terms of data similarity and analysis utility.
The results on five datasets show that the synthetic data of CTAB-GAN
remarkably resembles the real data for all three types of variables and results
into higher accuracy for five machine learning algorithms, by up to 17%.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Improving Correlation Capture in Generating Imbalanced Data using
Differentially Private Conditional GANs [2.2265840715792735]
We propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving data.
We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement.
arXiv Detail & Related papers (2022-06-28T06:47:27Z) - CTAB-GAN+: Enhancing Tabular Data Synthesis [11.813626861559904]
CTAB-GAN+ improves upon state-of-the-art GANs by adding downstream losses to conditional GANs for higher utility synthetic data domains.
We show CTAB-GAN+ synthesizes privacy-preserving data with at least 48.16% higher utility across multiple datasets and learning tasks under different privacy budgets.
arXiv Detail & Related papers (2022-04-01T12:52:30Z) - Federated Causal Discovery [74.37739054932733]
This paper develops a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD)
It can learn the causal graph without directly touching local data and naturally handle the data heterogeneity.
Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
arXiv Detail & Related papers (2021-12-07T08:04:12Z) - Effective and Privacy preserving Tabular Data Synthesizing [0.0]
We develop novel conditional table GAN architecture that can model diverse data types with complex distributions.
We train CTAB-GAN with strict privacy guarantees to ensure greater security for training GANs against malicious privacy attacks.
arXiv Detail & Related papers (2021-08-11T13:55:48Z) - CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for
Natural Language Understanding [67.61357003974153]
We propose a novel data augmentation framework dubbed CoDA.
CoDA synthesizes diverse and informative augmented examples by integrating multiple transformations organically.
A contrastive regularization objective is introduced to capture the global relationship among all the data samples.
arXiv Detail & Related papers (2020-10-16T23:57:03Z) - Synthesizing Property & Casualty Ratemaking Datasets using Generative
Adversarial Networks [2.2649197740853677]
We show how to design three types of generative adversarial networks (GANs) that can build a synthetic insurance dataset from a confidential original dataset.
For transparency, the approaches are illustrated using a public dataset, the French motor third party liability data.
We find that the MC-WGAN-GP synthesizes the best data, the CTGAN is the easiest to use, and the MNCDP-GAN guarantees differential privacy.
arXiv Detail & Related papers (2020-08-13T21:02:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.