Synthesize High-dimensional Longitudinal Electronic Health Records via
Hierarchical Autoregressive Language Model
- URL: http://arxiv.org/abs/2304.02169v3
- Date: Thu, 9 Nov 2023 21:59:41 GMT
- Title: Synthesize High-dimensional Longitudinal Electronic Health Records via
Hierarchical Autoregressive Language Model
- Authors: Brandon Theodorou, Cao Xiao, and Jimeng Sun
- Abstract summary: Synthetic electronic health records can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis.
We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR.
- Score: 40.473866438962034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic electronic health records (EHRs) that are both realistic and
preserve privacy can serve as an alternative to real EHRs for machine learning
(ML) modeling and statistical analysis. However, generating high-fidelity and
granular electronic health record (EHR) data in its original,
highly-dimensional form poses challenges for existing methods due to the
complexities inherent in high-dimensional data. In this paper, we propose
Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal
high-dimensional EHR, which preserve the statistical properties of real EHR and
can be used to train accurate ML models without privacy concerns. Our HALO
method, designed as a hierarchical autoregressive model, generates a
probability density function of medical codes, clinical visits, and patient
records, allowing for the generation of realistic EHR data in its original,
unaggregated form without the need for variable selection or aggregation.
Additionally, our model also produces high-quality continuous variables in a
longitudinal and probabilistic manner. We conducted extensive experiments and
demonstrate that HALO can generate high-fidelity EHR data with high-dimensional
disease code probabilities (d > 10,000), disease co-occurrence probabilities
within visits (d > 1,000,000), and conditional probabilities across consecutive
visits (d > 5,000,000) and achieve above 0.9 R2 correlation in comparison to
real EHR data. This performance then enables downstream ML models trained on
its synthetic data to achieve comparable accuracy to models trained on real
data (0.938 AUROC with HALO data vs. 0.943 with real data). Finally, using a
combination of real and synthetic data enhances the accuracy of ML models
beyond that achieved by using only real EHR data.
Related papers
- Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - IGNITE: Individualized GeNeration of Imputations in Time-series
Electronic health records [7.451873794596469]
We propose a novel deep-learning model that learns the underlying patient dynamics to generate personalized values conditioning on an individual's demographic characteristics and treatments.
Our proposed model, IGNITE, utilise a conditional dual-variational autoencoder augmented with dual-stage attention to generate missing values for an individual.
We show that IGNITE outperforms state-of-the-art approaches in missing data reconstruction and task prediction.
arXiv Detail & Related papers (2024-01-09T07:57:21Z) - How Good Are Synthetic Medical Images? An Empirical Study with Lung
Ultrasound [0.3312417881789094]
Adding synthetic training data using generative models offers a low-cost method to deal with the data scarcity challenge.
We show that training with both synthetic and real data outperforms training with real data alone.
arXiv Detail & Related papers (2023-10-05T15:42:53Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Can segmentation models be trained with fully synthetically generated
data? [0.39577682622066246]
BrainSPADE is a model which combines a synthetic diffusion-based label generator with a semantic image generator.
Our model can produce fully synthetic brain labels on-demand, with or without pathology of interest, and then generate a corresponding MRI image of an arbitrary guided style.
Experiments show that brainSPADE synthetic data can be used to train segmentation models with performance comparable to that of models trained on real data.
arXiv Detail & Related papers (2022-09-17T05:24:04Z) - Generating Synthetic Mixed-type Longitudinal Electronic Health Records
for Artificial Intelligent Applications [9.374416143268892]
generative adversarial network (GAN) entitled EHR-M-GAN which synthesizes textitmixed-type timeseries EHR data.
We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients.
arXiv Detail & Related papers (2021-12-22T17:17:34Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - EVA: Generating Longitudinal Electronic Health Records Using Conditional
Variational Autoencoders [34.22731849545798]
We propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters and encounter features.
We illustrate that EVA can produce realistic sequences, account for individual differences among patients, and can be conditioned on specific disease conditions.
We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients.
arXiv Detail & Related papers (2020-12-18T02:37:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.