SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis
- URL: http://arxiv.org/abs/2511.11935v1
- Date: Fri, 14 Nov 2025 23:19:14 GMT
- Title: SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis
- Authors: Munib Mesinovic, Tingting Zhu,
- Abstract summary: We present SurvBench, a comprehensive, open-source preprocessing pipeline that transforms raw PhysioNet datasets into model-ready tensors for multi-modal survival analysis.<n>SurvBench provides data loaders for three major critical care databases, MIMIC-IV, eICU, and MC-MED.<n>The pipeline implements rigorous data quality controls, patient-level splitting to prevent data leakage, explicit missingness tracking, and standardised temporal aggregation.
- Score: 2.74994442100348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Electronic health record (EHR) data present tremendous opportunities for advancing survival analysis through deep learning, yet reproducibility remains severely constrained by inconsistent preprocessing methodologies. We present SurvBench, a comprehensive, open-source preprocessing pipeline that transforms raw PhysioNet datasets into standardised, model-ready tensors for multi-modal survival analysis. SurvBench provides data loaders for three major critical care databases, MIMIC-IV, eICU, and MC-MED, supporting diverse modalities including time-series vitals, static demographics, ICD diagnosis codes, and radiology reports. The pipeline implements rigorous data quality controls, patient-level splitting to prevent data leakage, explicit missingness tracking, and standardised temporal aggregation. SurvBench handles both single-risk (e.g., in-hospital mortality) and competing-risks scenarios (e.g., multiple discharge outcomes). The outputs are compatible with pycox library packages and implementations of standard statistical and deep learning models. By providing reproducible, configuration-driven preprocessing with comprehensive documentation, SurvBench addresses the "preprocessing gap" that has hindered fair comparison of deep learning survival models, enabling researchers to focus on methodological innovation rather than data engineering.
Related papers
- Counterfactual Understanding via Retrieval-aware Multimodal Modeling for Time-to-Event Survival Prediction [1.5713805841057418]
CURE is a framework that advances counterfactual survival modeling via comprehensive multimodal embedding and latent retrieval.<n> Experimental results on METABRIC and TCGA-LUAD datasets demonstrate that proposed CURE model consistently outperforms strong baselines in survival analysis.
arXiv Detail & Related papers (2026-02-23T15:53:25Z) - Deep Survival Analysis for Competing Risk Modeling with Functional Covariates and Missing Data Imputation [13.108896747775063]
We introduce the Functional Competing Risk Net (FCRN), a unified deep-learning framework for discrete-time survival analysis under competing risks.<n>By combining a micro-network Basis Layer for functional data representation with a gradient-based imputation module, FCRN simultaneously learns to impute missing values and predict event-specific hazards.
arXiv Detail & Related papers (2025-09-29T18:33:00Z) - Revisiting Multivariate Time Series Forecasting with Missing Values [65.30332997607141]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z) - impuTMAE: Multi-modal Transformer with Masked Pre-training for Missing Modalities Imputation in Cancer Survival Prediction [75.43342771863837]
We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy.<n>It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches.<n>Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets.
arXiv Detail & Related papers (2025-08-08T10:01:16Z) - Deep Survival Analysis in Multimodal Medical Data: A Parametric and Probabilistic Approach with Competing Risks [47.19194118883552]
We introduce a multimodal deep learning framework for survival analysis capable of modeling both single and competing risks scenarios.<n>We propose SAMVAE (Survival Analysis Multimodal Variational Autoencoder), a novel deep learning architecture designed for survival prediction.
arXiv Detail & Related papers (2025-07-10T14:29:48Z) - Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z) - SurvHive: a package to consistently access multiple survival-analysis packages [0.0]
SurvHive is a Python-based framework designed to unify survival analysis methods within a coherent and interface modeled on scikit-learn.<n>SurvHive integrates classical statistical models with cutting-edge deep learning approaches, including transformer-based architectures and parametric survival models.
arXiv Detail & Related papers (2025-02-04T11:02:40Z) - CAAT-EHR: Cross-Attentional Autoregressive Transformer for Multimodal Electronic Health Record Embeddings [0.0]
We introduce CAAT-EHR, a novel architecture designed to generate task-agnostic longitudinal embeddings from raw EHR data.<n>An autoregressive decoder complements the encoder by predicting future time points data during pre-training, ensuring that the resulting embeddings maintain temporal consistency and alignment.
arXiv Detail & Related papers (2025-01-31T05:00:02Z) - MIBP-Cert: Certified Training against Data Perturbations with Mixed-Integer Bilinear Programs [50.41998220099097]
Data errors, corruptions, and poisoning attacks during training pose a major threat to the reliability of modern AI systems.<n>We introduce MIBP-Cert, a novel certification method based on mixed-integer bilinear programming (MIBP)<n>By computing the set of parameters reachable through perturbed or manipulated data, we can predict all possible outcomes and guarantee robustness.
arXiv Detail & Related papers (2024-12-13T14:56:39Z) - Multi-modal Data Binding for Survival Analysis Modeling with Incomplete Data and Annotations [19.560652381770243]
We introduce a novel framework that simultaneously handles incomplete data across modalities and censored survival labels.
Our approach employs advanced foundation models to encode individual modalities and align them into a universal representation space.
The proposed method demonstrates outstanding prediction accuracy in two survival analysis tasks on both employed datasets.
arXiv Detail & Related papers (2024-07-25T02:55:39Z) - Clairvoyance: A Pipeline Toolkit for Medical Time Series [95.22483029602921]
Time-series learning is the bread and butter of data-driven *clinical decision support*
Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a software toolkit.
Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML.
arXiv Detail & Related papers (2023-10-28T12:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.