Related papers: EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

URL: http://arxiv.org/abs/2012.10020v1
Date: Fri, 18 Dec 2020 02:37:49 GMT
Title: EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders
Authors: Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart and Jimeng Sun
Abstract summary: We propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters and encounter features. We illustrate that EVA can produce realistic sequences, account for individual differences among patients, and can be conditioned on specific disease conditions. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients.
Score: 34.22731849545798
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.

Related papers

High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training [3.864395218585964]
Development of machine learning for cardiac care is hampered by privacy restrictions on sharing real patient electrocardiogram (ECG) data.<n>In this work, we address two major shortcomings of current generative ECG methods.<n>We build on a conditional diffusion-based Structured State Space Model (SSSD-ECG) with two principled innovations.
arXiv Detail & Related papers (2025-10-07T01:14:53Z)
IGNITE: Individualized GeNeration of Imputations in Time-series Electronic health records [7.451873794596469]
We propose a novel deep-learning model that learns the underlying patient dynamics to generate personalized values conditioning on an individual's demographic characteristics and treatments. Our proposed model, IGNITE, utilise a conditional dual-variational autoencoder augmented with dual-stage attention to generate missing values for an individual. We show that IGNITE outperforms state-of-the-art approaches in missing data reconstruction and task prediction.
arXiv Detail & Related papers (2024-01-09T07:57:21Z)
Reliable Generation of Privacy-preserving Synthetic Electronic Health Record Time Series via Diffusion Models [4.240899165468488]
Electronic Health Records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic EHR time series efficiently.
arXiv Detail & Related papers (2023-10-23T18:56:01Z)
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z)
TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials. We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z)
Synthesize High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model [40.473866438962034]
Synthetic electronic health records can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR.
arXiv Detail & Related papers (2023-04-04T23:53:34Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)
Integrated Convolutional and Recurrent Neural Networks for Health Risk Prediction using Patient Journey Data with Many Missing Values [9.418011774179794]
This paper proposes a novel end-to-end approach to modeling EHR patient journey data with Integrated Convolutional and Recurrent Neural Networks. Our model can capture both long- and short-term temporal patterns within each patient journey and effectively handle the high degree of missingness in EHR data without any imputation data generation.
arXiv Detail & Related papers (2022-11-11T07:36:18Z)
Generating Synthetic Mixed-type Longitudinal Electronic Health Records for Artificial Intelligent Applications [9.374416143268892]
generative adversarial network (GAN) entitled EHR-M-GAN which synthesizes textitmixed-type timeseries EHR data. We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients.
arXiv Detail & Related papers (2021-12-22T17:17:34Z)
SANSformers: Self-Supervised Forecasting in Electronic Health Records with Attention-Free Models [48.07469930813923]
This work aims to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities. We introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data. Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.
arXiv Detail & Related papers (2021-08-31T08:23:56Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Hemogram Data as a Tool for Decision-making in COVID-19 Management: Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure. This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients. Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.