Representation Learning of Lab Values via Masked AutoEncoder
- URL: http://arxiv.org/abs/2501.02648v2
- Date: Thu, 09 Jan 2025 11:17:01 GMT
- Title: Representation Learning of Lab Values via Masked AutoEncoder
- Authors: David Restrepo, Chenwei Wu, Yueran Jia, Jaden K. Sun, Jack Gallifant, Catherine G. Bielick, Yugang Jia, Leo A. Celi,
- Abstract summary: We propose Lab-MAE, a transformer-based masked autoencoder framework for imputation of sequential lab values.
Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines.
Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions.
- Score: 2.785172582119726
- License:
- Abstract: Accurate imputation of missing laboratory values in electronic health records (EHRs) is critical to enable robust clinical predictions and reduce biases in AI systems in healthcare. Existing methods, such as variational autoencoders (VAEs) and decision tree-based approaches such as XGBoost, struggle to model the complex temporal and contextual dependencies in EHR data, mainly in underrepresented groups. In this work, we propose Lab-MAE, a novel transformer-based masked autoencoder framework that leverages self-supervised learning for the imputation of continuous sequential lab values. Lab-MAE introduces a structured encoding scheme that jointly models laboratory test values and their corresponding timestamps, enabling explicit capturing temporal dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines such as XGBoost across multiple metrics, including root mean square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions. We further investigate the role of follow-up laboratory values as potential shortcut features, revealing Lab-MAE's robustness in scenarios where such data is unavailable. The findings suggest that our transformer-based architecture, adapted to the characteristics of the EHR data, offers a foundation model for more accurate and fair clinical imputation models. In addition, we measure and compare the carbon footprint of Lab-MAE with the baseline XGBoost model, highlighting its environmental requirements.
Related papers
- LV-CadeNet: Long View Feature Convolution-Attention Fusion Encoder-Decoder Network for Clinical MEG Spike Detection [5.140340328388902]
We introduce LV-CadeNet, designed for automatic MEG spike detection in real-world clinical scenarios.
Our approach also mimics human specialists by constructing long view morphological input data.
LV-CadeNet significantly improves the accuracy of MEG spike detection, boosting it from 42.31% to 54.88% on a novel clinical dataset.
arXiv Detail & Related papers (2024-12-12T03:19:44Z) - Development and Comparative Analysis of Machine Learning Models for Hypoxemia Severity Triage in CBRNE Emergency Scenarios Using Physiological and Demographic Data from Medical-Grade Devices [0.0]
Gradient Boosting Models (GBMs) outperformed sequential models in terms of training speed, interpretability, and reliability.
A 5-minute prediction window was chosen for timely intervention, with minute-levels standardizing the data.
This study highlights ML's potential to improve triage and reduce alarm fatigue.
arXiv Detail & Related papers (2024-10-30T23:24:28Z) - Evaluating the Fairness of the MIMIC-IV Dataset and a Baseline
Algorithm: Application to the ICU Length of Stay Prediction [65.268245109828]
This paper uses the MIMIC-IV dataset to examine the fairness and bias in an XGBoost binary classification model predicting the ICU length of stay.
The research reveals class imbalances in the dataset across demographic attributes and employs data preprocessing and feature extraction.
The paper concludes with recommendations for fairness-aware machine learning techniques for mitigating biases and the need for collaborative efforts among healthcare professionals and data scientists.
arXiv Detail & Related papers (2023-12-31T16:01:48Z) - Mixed-Integer Projections for Automated Data Correction of EMRs Improve
Predictions of Sepsis among Hospitalized Patients [7.639610349097473]
We introduce an innovative projections-based method that seamlessly integrates clinical expertise as domain constraints.
We measure the distance of corrected data from the constraints defining a healthy range of patient data, resulting in a unique predictive metric we term as "trust-scores"
We show an AUROC of 0.865 and a precision of 0.922, that surpasses conventional ML models without such projections.
arXiv Detail & Related papers (2023-08-21T15:14:49Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - Clinical Deterioration Prediction in Brazilian Hospitals Based on
Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD)
The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z) - Assessing Robustness of EEG Representations under Data-shifts via Latent
Space and Uncertainty Analysis [0.29998889086656577]
We develop diagnostic measures to detect potential pitfalls during deployment without assuming access to external data.
Specifically, we focus on modeling realistic data shifts in electrophysiological signals (EEGs) via data transforms.
We conduct experiments on multiple EEG feature encoders and two clinically relevant downstream tasks using publicly available large-scale clinical EEGs.
arXiv Detail & Related papers (2022-09-22T19:26:09Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.