Distributed Conditional GAN (discGAN) For Synthetic Healthcare Data
Generation
- URL: http://arxiv.org/abs/2304.04290v1
- Date: Sun, 9 Apr 2023 18:35:05 GMT
- Title: Distributed Conditional GAN (discGAN) For Synthetic Healthcare Data
Generation
- Authors: David Fuentes, Diana McSpadden and Sodiq Adewole
- Abstract summary: We propose a distributed Generative Adversarial Networks (discGANs) to generate synthetic data specific to the healthcare domain.
We generated 249,000 synthetic records from original 2,027 eICU dataset.
Our results show that discGAN was able to generate data with distributions similar to the real data.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper, we propose a distributed Generative Adversarial Networks
(discGANs) to generate synthetic tabular data specific to the healthcare
domain. While using GANs to generate images has been well studied, little to no
attention has been given to generation of tabular data. Modeling distributions
of discrete and continuous tabular data is a non-trivial task with high
utility. We applied discGAN to model non-Gaussian multi-modal healthcare data.
We generated 249,000 synthetic records from original 2,027 eICU dataset. We
evaluated the performance of the model using machine learning efficacy, the
Kolmogorov-Smirnov (KS) test for continuous variables and chi-squared test for
discrete variables. Our results show that discGAN was able to generate data
with distributions similar to the real data.
Related papers
- CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data [16.166752861658953]
When the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models.
This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately.
We propose CtrTab to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios.
arXiv Detail & Related papers (2025-03-09T05:01:56Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - Synthesizing Mixed-type Electronic Health Records using Diffusion Models [10.973115905786129]
Synthetic data generation is a promising solution to mitigate privacy concerns when sharing sensitive patient information.
Recent studies have shown that diffusion models offer several advantages over GANs, such as generation of more realistic synthetic data and stable training in generating data modalities, including image, text, and sound.
Our experiments demonstrate that TabDDPM outperforms the state-of-the-art models across all evaluation metrics, except for privacy, which confirms the trade-off between privacy and utility.
arXiv Detail & Related papers (2023-02-28T15:42:30Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - Language Models are Realistic Tabular Data Generators [15.851912974874116]
We propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative large language model (LLMs) to sample synthetic and yet highly realistic data.
We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
arXiv Detail & Related papers (2022-10-12T15:03:28Z) - Improving Correlation Capture in Generating Imbalanced Data using
Differentially Private Conditional GANs [2.2265840715792735]
We propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving data.
We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement.
arXiv Detail & Related papers (2022-06-28T06:47:27Z) - Generative Adversarial Networks for Synthetic Data Generation: A
Comparative Study [1.0896567381206714]
Generative Adversarial Networks (GANs) are gaining increasing attention as a means for synthesising data.
Here we consider the potential application of GANs for the purpose of generating synthetic census microdata.
arXiv Detail & Related papers (2021-12-03T14:23:17Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - GANs with Conditional Independence Graphs: On Subadditivity of
Probability Divergences [70.30467057209405]
Generative Adversarial Networks (GANs) are modern methods to learn the underlying distribution of a data set.
GANs are designed in a model-free fashion where no additional information about the underlying distribution is available.
We propose a principled design of a model-based GAN that uses a set of simple discriminators on the neighborhoods of the Bayes-net/MRF.
arXiv Detail & Related papers (2020-03-02T04:31:22Z) - Distribution Approximation and Statistical Estimation Guarantees of
Generative Adversarial Networks [82.61546580149427]
Generative Adversarial Networks (GANs) have achieved a great success in unsupervised learning.
This paper provides approximation and statistical guarantees of GANs for the estimation of data distributions with densities in a H"older space.
arXiv Detail & Related papers (2020-02-10T16:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.