Related papers: CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis

CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis

URL: http://arxiv.org/abs/2307.00384v2
Date: Mon, 22 Jan 2024 22:12:02 GMT
Title: CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis
Authors: Abdallah Alshantti, Damiano Varagnolo, Adil Rasheed, Aria Rahmati and Frank Westad
Abstract summary: Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data. The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
Score: 0.4999814847776097
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data which can be utilised for multiple purposes. While GANs have demonstrated tremendous successes in producing synthetic data samples that replicate the dynamics of the original datasets, the validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed. In this work, we design a cascaded tabular GAN framework (CasTGAN) for generating realistic tabular data with a specific focus on the validity of the output. In this context, validity refers to the the dependency between features that can be found in the real data, but is typically misrepresented by traditional generative models. Our key idea entails that employing a cascaded architecture in which a dedicated generator samples each feature, the synthetic output becomes more representative of the real data. Our experimental results demonstrate that our model is capable of generating synthetic tabular data that can be used for fitting machine learning models. In addition, our model captures well the constraints and the correlations between the features of the real data, especially the high dimensional datasets. Furthermore, we evaluate the risk of white-box privacy attacks on our model and subsequently show that applying some perturbations to the auxiliary learners in CasTGAN increases the overall robustness of our model against targeted attacks.

Related papers

Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
zGAN: An Outlier-focused Generative Adversarial Network For Realistic Synthetic Data Generation [0.0]
"Black swans" have posed a challenge to performance of classical machine learning models. This article provides an overview of the zGAN model architecture developed for the purpose of generating synthetic data with outlier characteristics. It shows promising results on realistic synthetic data generation, as well as uplift capabilities vis-a-vis model performance.
arXiv Detail & Related papers (2024-10-28T07:55:11Z)
MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data [10.217822818544475]
We propose a framework to generate synthetic (tabular) data powered by large language models (LLMs) Our approach significantly enhance the quality of synthetic data generation in common scenarios with small sample sizes. Our results demonstrate that our model outperforms several state-of-art models regarding generating higher quality synthetic data for downstream tasks while keeping privacy of the real data.
arXiv Detail & Related papers (2024-06-15T06:26:17Z)
An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
Generative Modeling for Tabular Data via Penalized Optimal Transport Network [2.0319002824093015]
Wasserstein generative adversarial network (WGAN) is a notable improvement in generative modeling. We propose POTNet, a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss.
arXiv Detail & Related papers (2024-02-16T05:27:05Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases. The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z)
Differentially Private Synthetic Medical Data Generation using Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy. Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data. We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z)
Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset. With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset. In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.