DATGAN: Integrating expert knowledge into deep learning for synthetic
tabular data
- URL: http://arxiv.org/abs/2203.03489v1
- Date: Mon, 7 Mar 2022 16:09:03 GMT
- Title: DATGAN: Integrating expert knowledge into deep learning for synthetic
tabular data
- Authors: Gael Lederrey, Tim Hillel, Michel Bierlaire
- Abstract summary: Synthetic data can be used in various applications, such as correcting bias datasets or replacing scarce original data for simulation purposes.
Deep learning models are data-driven and it is difficult to control the generation process.
This article presents the Directed Acyclic Tabular GAN ( DATGAN) to address these limitations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data can be used in various applications, such as correcting bias
datasets or replacing scarce original data for simulation purposes. Generative
Adversarial Networks (GANs) are considered state-of-the-art for developing
generative models. However, these deep learning models are data-driven, and it
is, thus, difficult to control the generation process. It can, therefore, lead
to the following issues: lack of representativity in the generated data, the
introduction of bias, and the possibility of overfitting the sample's noise.
This article presents the Directed Acyclic Tabular GAN (DATGAN) to address
these limitations by integrating expert knowledge in deep learning models for
synthetic tabular data generation. This approach allows the interactions
between variables to be specified explicitly using a Directed Acyclic Graph
(DAG). The DAG is then converted to a network of modified Long Short-Term
Memory (LSTM) cells to accept multiple inputs. Multiple DATGAN versions are
systematically tested on multiple assessment metrics. We show that the best
versions of the DATGAN outperform state-of-the-art generative models on
multiple case studies. Finally, we show how the DAG can create hypothetical
synthetic datasets.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis.
DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure.
It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular
Data Synthesis [0.4999814847776097]
Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data.
The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
arXiv Detail & Related papers (2023-07-01T16:52:18Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Language Models are Realistic Tabular Data Generators [15.851912974874116]
We propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative large language model (LLMs) to sample synthetic and yet highly realistic data.
We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
arXiv Detail & Related papers (2022-10-12T15:03:28Z) - TTS-CGAN: A Transformer Time-Series Conditional GAN for Biosignal Data
Augmentation [5.607676459156789]
We present TTS-CGAN, a conditional GAN model that can be trained on existing multi-class datasets and generate class-specific synthetic time-series sequences.
Synthetic sequences generated by our model are indistinguishable from real ones, and can be used to complement or replace real signals of the same type.
arXiv Detail & Related papers (2022-06-28T01:01:34Z) - DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative
Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data.
We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data.
We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z) - Recent Developments Combining Ensemble Smoother and Deep Generative
Networks for Facies History Matching [58.720142291102135]
This research project focuses on the use of autoencoders networks to construct a continuous parameterization for facies models.
We benchmark seven different formulations, including VAE, generative adversarial network (GAN), Wasserstein GAN, variational auto-encoding GAN, principal component analysis (PCA) with cycle GAN, PCA with transfer style network and VAE with style loss.
arXiv Detail & Related papers (2020-05-08T21:32:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.