Generating Electronic Health Records with Multiple Data Types and
Constraints
- URL: http://arxiv.org/abs/2003.07904v2
- Date: Mon, 23 Mar 2020 22:01:37 GMT
- Title: Generating Electronic Health Records with Multiple Data Types and
Constraints
- Authors: Chao Yan, Ziqi Zhang, Steve Nyemba, Bradley A. Malin
- Abstract summary: Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions.
Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks.
We introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks.
- Score: 17.32526100692928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sharing electronic health records (EHRs) on a large scale may lead to privacy
intrusions. Recent research has shown that risks may be mitigated by simulating
EHRs through generative adversarial network (GAN) frameworks. Yet the methods
developed to date are limited because they 1) focus on generating data of a
single type (e.g., diagnosis codes), neglecting other data types (e.g.,
demographics, procedures or vital signs) and 2) do not represent constraints
between features. In this paper, we introduce a method to simulate EHRs
composed of multiple data types by 1) refining the GAN model, 2) accounting for
feature constraints, and 3) incorporating key utility measures for such
generation tasks. Our analysis with over $770,000$ EHRs from Vanderbilt
University Medical Center demonstrates that the new model achieves higher
performance in terms of retaining basic statistics, cross-feature correlations,
latent structural properties, feature constraints and associated patterns from
real data, without sacrificing privacy.
Related papers
- Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - Guided Discrete Diffusion for Electronic Health Record Generation [47.129056768385084]
EHRs are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research.
Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases.
To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs.
arXiv Detail & Related papers (2024-04-18T16:50:46Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - Reliable Generation of Privacy-preserving Synthetic Electronic Health Record Time Series via Diffusion Models [4.240899165468488]
Electronic Health Records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis.
However, privacy concerns often restrict access to EHRs, hindering downstream analysis.
This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic EHR time series efficiently.
arXiv Detail & Related papers (2023-10-23T18:56:01Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Generating Synthetic Mixed-type Longitudinal Electronic Health Records
for Artificial Intelligent Applications [9.374416143268892]
generative adversarial network (GAN) entitled EHR-M-GAN which synthesizes textitmixed-type timeseries EHR data.
We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients.
arXiv Detail & Related papers (2021-12-22T17:17:34Z) - SANSformers: Self-Supervised Forecasting in Electronic Health Records
with Attention-Free Models [48.07469930813923]
This work aims to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities.
We introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data.
Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.
arXiv Detail & Related papers (2021-08-31T08:23:56Z) - Categorical EHR Imputation with Generative Adversarial Nets [11.171712535005357]
We propose a simple and yet effective approach that is based on previous work on GANs for data imputation.
We show that our imputation approach largely improves the prediction accuracy, compared to more traditional data imputation approaches.
arXiv Detail & Related papers (2021-08-03T18:50:26Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z) - Generation of Differentially Private Heterogeneous Electronic Health
Records [9.926231893220061]
We explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs.
We will explore applying differential privacy (DP) preserving optimization in order to produce DP synthetic EHR data sets.
arXiv Detail & Related papers (2020-06-05T13:21:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.