Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data
- URL: http://arxiv.org/abs/2104.02932v1
- Date: Wed, 7 Apr 2021 06:02:04 GMT
- Title: Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data
- Authors: Tingyi Wanyan, Jing Zhang, Ying Ding, Ariful Azad, Zhangyang Wang,
Benjamin S Glicksberg
- Abstract summary: This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
- Score: 62.29031007761901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Electronic Health Record (EHR) data has been of tremendous utility in
Artificial Intelligence (AI) for healthcare such as predicting future clinical
events. These tasks, however, often come with many challenges when using
classical machine learning models due to a myriad of factors including class
imbalance and data heterogeneity (i.e., the complex intra-class variances). To
address some of these research gaps, this paper leverages the exciting
contrastive learning framework and proposes a novel contrastive regularized
clinical classification model. The contrastive loss is found to substantially
augment EHR-based prediction: it effectively characterizes the
similar/dissimilar patterns (by its "push-and-pull" form), meanwhile mitigating
the highly skewed class distribution by learning more balanced feature spaces
(as also echoed by recent findings). In particular, when naively exporting the
contrastive learning to the EHR data, one hurdle is in generating positive
samples, since EHR data is not as amendable to data augmentation as image data.
To this end, we have introduced two unique positive sampling strategies
specifically tailored for EHR data: a feature-based positive sampling that
exploits the feature space neighborhood structure to reinforce the feature
learning; and an attribute-based positive sampling that incorporates
pre-generated patient similarity metrics to define the sample proximity. Both
sampling approaches are designed with an awareness of unique high intra-class
variance in EHR data. Our overall framework yields highly competitive
experimental results in predicting the mortality risk on real-world COVID-19
EHR data with a total of 5,712 patients admitted to a large, urban health
system. Specifically, our method reaches a high AUROC prediction score of
0.959, which outperforms other baselines and alternatives: cross-entropy(0.873)
and focal loss(0.931).
Related papers
- SeqRisk: Transformer-augmented latent variable model for improved survival prediction with longitudinal data [4.1476925904032464]
We propose SeqRisk, a method that combines variational autoencoder (VAE) or longitudinal VAE (LVAE) with a transformer encoder and Cox proportional hazards module for risk prediction.
We demonstrate that SeqRisk performs competitively compared to existing approaches on both simulated and real-world datasets.
arXiv Detail & Related papers (2024-09-19T12:35:25Z) - MCRAGE: Synthetic Healthcare Data for Fairness [3.0089659534785853]
We propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE) to augment imbalanced datasets.
MCRAGE involves training a Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes.
We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes.
arXiv Detail & Related papers (2023-10-27T19:02:22Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - SANSformers: Self-Supervised Forecasting in Electronic Health Records
with Attention-Free Models [48.07469930813923]
This work aims to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities.
We introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data.
Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.
arXiv Detail & Related papers (2021-08-31T08:23:56Z) - Categorical EHR Imputation with Generative Adversarial Nets [11.171712535005357]
We propose a simple and yet effective approach that is based on previous work on GANs for data imputation.
We show that our imputation approach largely improves the prediction accuracy, compared to more traditional data imputation approaches.
arXiv Detail & Related papers (2021-08-03T18:50:26Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Generation of Differentially Private Heterogeneous Electronic Health
Records [9.926231893220061]
We explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs.
We will explore applying differential privacy (DP) preserving optimization in order to produce DP synthetic EHR data sets.
arXiv Detail & Related papers (2020-06-05T13:21:46Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.