Generation of Differentially Private Heterogeneous Electronic Health
Records
- URL: http://arxiv.org/abs/2006.03423v1
- Date: Fri, 5 Jun 2020 13:21:46 GMT
- Title: Generation of Differentially Private Heterogeneous Electronic Health
Records
- Authors: Kieran Chin-Cheong, Thomas Sutter and Julia E. Vogt
- Abstract summary: We explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs.
We will explore applying differential privacy (DP) preserving optimization in order to produce DP synthetic EHR data sets.
- Score: 9.926231893220061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Electronic Health Records (EHRs) are commonly used by the machine learning
community for research on problems specifically related to health care and
medicine. EHRs have the advantages that they can be easily distributed and
contain many features useful for e.g. classification problems. What makes EHR
data sets different from typical machine learning data sets is that they are
often very sparse, due to their high dimensionality, and often contain
heterogeneous (mixed) data types. Furthermore, the data sets deal with
sensitive information, which limits the distribution of any models learned
using them, due to privacy concerns. For these reasons, using EHR data in
practice presents a real challenge. In this work, we explore using Generative
Adversarial Networks to generate synthetic, heterogeneous EHRs with the goal of
using these synthetic records in place of existing data sets for downstream
classification tasks. We will further explore applying differential privacy
(DP) preserving optimization in order to produce DP synthetic EHR data sets,
which provide rigorous privacy guarantees, and are therefore shareable and
usable in the real world. The performance (measured by AUROC, AUPRC and
accuracy) of our model's synthetic, heterogeneous data is very close to the
original data set (within 3 - 5% of the baseline) for the non-DP model when
tested in a binary classification task. Using strong $(1, 10^{-5})$ DP, our
model still produces data useful for machine learning tasks, albeit incurring a
roughly 17% performance penalty in our tested classification task. We
additionally perform a sub-population analysis and find that our model does not
introduce any bias into the synthetic EHR data compared to the baseline in
either male/female populations, or the 0-18, 19-50 and 51+ age groups in terms
of classification performance for either the non-DP or DP variant.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthesizing Mixed-type Electronic Health Records using Diffusion Models [10.973115905786129]
Synthetic data generation is a promising solution to mitigate privacy concerns when sharing sensitive patient information.
Recent studies have shown that diffusion models offer several advantages over GANs, such as generation of more realistic synthetic data and stable training in generating data modalities, including image, text, and sound.
Our experiments demonstrate that TabDDPM outperforms the state-of-the-art models across all evaluation metrics, except for privacy, which confirms the trade-off between privacy and utility.
arXiv Detail & Related papers (2023-02-28T15:42:30Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Categorical EHR Imputation with Generative Adversarial Nets [11.171712535005357]
We propose a simple and yet effective approach that is based on previous work on GANs for data imputation.
We show that our imputation approach largely improves the prediction accuracy, compared to more traditional data imputation approaches.
arXiv Detail & Related papers (2021-08-03T18:50:26Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Generating Electronic Health Records with Multiple Data Types and
Constraints [17.32526100692928]
Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions.
Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks.
We introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks.
arXiv Detail & Related papers (2020-03-17T19:25:16Z) - DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment
Prediction [67.91606509226132]
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment.
DeepEnroll is a cross-modal inference learning model to jointly encode enrollment criteria (tabular data) into a shared latent space for matching inference.
arXiv Detail & Related papers (2020-01-22T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.