Related papers: Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms

Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms

URL: http://arxiv.org/abs/2509.10369v1
Date: Fri, 12 Sep 2025 16:01:18 GMT
Title: Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms
Authors: Gul Rukh Khattak, Konstantinos Patlatzoglou, Joseph Barker, Libor Pastika, Boroumand Zeidaabadi, Ahmed El-Medany, Hesham Aggour, Yixiu Liang, Antonio H. Ribeiro, Jeffrey Annis, Antonio Luiz Pinho Ribeiro, Junbo Ge, Daniel B. Kramer, Jonathan W. Waks, Evan Brittain, Nicholas Peters, Fu Siong Ng, Arunashis Sau,
Abstract summary: We present Contrasting by Patient Augmented Electrocardiograms (CAPE) foundation model and pretrain on four cohorts.<n>We assess how cohort demographics, health status, and population diversity influence the downstream performance for prediction tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contrastive learning is a widely adopted self-supervised pretraining strategy, yet its dependence on cohort composition remains underexplored. We present Contrasting by Patient Augmented Electrocardiograms (CAPE) foundation model and pretrain on four cohorts (n = 5,203,352), from diverse populations across three continents (North America, South America, Asia). We systematically assess how cohort demographics, health status, and population diversity influence the downstream performance for prediction tasks also including two additional cohorts from another continent (Europe). We find that downstream performance depends on the distributional properties of the pretraining cohort, including demographics and health status. Moreover, while pretraining with a multi-centre, demographically diverse cohort improves in-distribution accuracy, it reduces out-of-distribution (OOD) generalisation of our contrastive approach by encoding cohort-specific artifacts. To address this, we propose the In-Distribution Batch (IDB) strategy, which preserves intra-cohort consistency during pretraining and enhances OOD robustness. This work provides important insights for developing clinically fair and generalisable foundation models.

Related papers

Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z)
Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z)
General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases [0.39508022083907385]
This study proposes a General Demographic Pre-trained (GDP) model as a foundational representation framework tailored to age and gender.<n>The model is pre-trained and evaluated using datasets with diverse diseases and population compositions from different geographic regions.
arXiv Detail & Related papers (2025-09-09T02:02:27Z)
Evaluating the Fairness of the MIMIC-IV Dataset and a Baseline Algorithm: Application to the ICU Length of Stay Prediction [65.268245109828]
This paper uses the MIMIC-IV dataset to examine the fairness and bias in an XGBoost binary classification model predicting the ICU length of stay. The research reveals class imbalances in the dataset across demographic attributes and employs data preprocessing and feature extraction. The paper concludes with recommendations for fairness-aware machine learning techniques for mitigating biases and the need for collaborative efforts among healthcare professionals and data scientists.
arXiv Detail & Related papers (2023-12-31T16:01:48Z)
Density-Aware Personalized Training for Risk Prediction in Imbalanced Medical Data [89.79617468457393]
Training models with imbalance rate (class density discrepancy) may lead to suboptimal prediction. We propose a framework for training models for this imbalance issue. We demonstrate our model's improved performance in real-world medical datasets.
arXiv Detail & Related papers (2022-07-23T00:39:53Z)
An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation. We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological Regularization [76.57716281104938]
We develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously. STELAR enables long-term prediction by incorporating latent temporal regularization through a system of discrete-time difference equations. We conduct experiments using both county- and state-level COVID-19 data and show that our model can identify interesting latent patterns of the epidemic.
arXiv Detail & Related papers (2020-12-08T21:21:47Z)
WRSE -- a non-parametric weighted-resolution ensemble for predicting individual survival distributions in the ICU [0.251657752676152]
Dynamic assessment of mortality risk in the intensive care unit (ICU) can be used to stratify patients, inform about treatment effectiveness or serve as part of an early-warning system. We show competitive results with state-of-the-art probabilistic models, while greatly reducing training time by factors of 2-9x.
arXiv Detail & Related papers (2020-11-02T10:13:59Z)
Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units. The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.