Related papers: Tipping the Balance: Impact of Class Imbalance Correction on the Performance of Clinical Risk Prediction Models

Tipping the Balance: Impact of Class Imbalance Correction on the Performance of Clinical Risk Prediction Models

URL: http://arxiv.org/abs/2603.00208v1
Date: Fri, 27 Feb 2026 12:29:16 GMT
Title: Tipping the Balance: Impact of Class Imbalance Correction on the Performance of Clinical Risk Prediction Models
Authors: Amalie Koch Andersen, Hadi Mehdizavareh, Arijit Khan, Tobias Becher, Simone Britsch, Markward Britsch, Morten Bøttcher, Simon Winther, Palle Duun Rohde, Morten Hasselstrøm Jensen, Simon Lebech Cichosz,
Abstract summary: Class-imbalance correction techniques are commonly applied to improve model performance in settings with rare outcomes.<n>This study evaluated the effect of widely used resampling strategies on both discrimination and calibration across real-world clinical prediction tasks.
Score: 2.2534253247996214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Objective: ML-based clinical risk prediction models are increasingly used to support decision-making in healthcare. While class-imbalance correction techniques are commonly applied to improve model performance in settings with rare outcomes, their impact on probabilistic calibration remains insufficiently understood. This study evaluated the effect of widely used resampling strategies on both discrimination and calibration across real-world clinical prediction tasks. Methods: Ten clinical datasets spanning diverse medical domains and including 605,842 patients were analyzed. Multiple machine-learning model families, including linear models and several non-linear approaches, were evaluated. Models were trained on the original data and under three commonly used 1:1 class-imbalance correction strategies (SMOTE, RUS, ROS). Performance was assessed on held-out data using discrimination and calibration metrics. Results: Across all datasets and model families, resampling had no positive impact on predictive performance. Changes in the Receiver Operating Characteristic Area Under Curve (ROC-AUC) relative to models trained on the original data were small and inconsistent (ROS: -0.002, p<0.05; RUS: -0.004, p>0.05; SMOTE: -0.01, p<0.05), with no resampling strategy demonstrating a systematic improvement. In contrast, resampling in general degraded the calibration performance. Models trained using imbalance correction exhibited higher Brier scores (0.029 to 0.080, p<0.05), reflecting poorer probabilistic accuracy, and marked deviations in calibration intercept and slope, indicating systematic distortions of predicted risk despite preserved rank-based performance. Conclusion: In a diverse set of real-world clinical prediction tasks, commonly used class-imbalance correction techniques did not provide generalizable improvements in discrimination and were associated with degraded calibration.

Related papers

Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare [0.0]
We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics.<n>We apply these diagnostics to simulated data and GUSTO-I clinical dataset.
arXiv Detail & Related papers (2026-02-27T03:42:28Z)
Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z)
Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods [0.0]
We study model-agnostic post-hoc calibration methods to improve probabilistic predictions in supervised binary classification.<n>We benchmark 21 widely used classifiers, including linear models, SVMs, tree ensembles (CatBoost, XGBoost, LightGBM)<n>We find that commonly used calibration procedures, most notably Platt scaling and isotonic regression, can systematically degrade proper scoring performance.
arXiv Detail & Related papers (2026-01-19T18:23:36Z)
An Explainable and Fair AI Tool for PCOS Risk Assessment: Calibration, Subgroup Equity, and Interactive Clinical Deployment [0.10026496861838446]
This paper presents a fairness-audited and interpretable machine learning framework for predicting polycystic ovary syndrome (PCOS)<n>The framework integrated SHAP-based feature attributions with demographic audits to connect predictive explanations with observed disparities for actionable insights.<n>A Streamlit-based web interface enables real-time PCOS risk assessment, Rotterdam criteria evaluation, and interactive 'what-if' analysis.
arXiv Detail & Related papers (2025-11-08T16:14:56Z)
Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z)
Calibration of Neural Networks [77.34726150561087]
This paper presents a survey of confidence calibration problems in the context of neural networks. We analyze problem statement, calibration definitions, and different approaches to evaluation. Empirical experiments cover various datasets and models, comparing calibration methods according to different criteria.
arXiv Detail & Related papers (2023-03-19T20:27:51Z)
On the Importance of Calibration in Semi-supervised Learning [13.859032326378188]
State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been highly successful in leveraging a mix of labeled and unlabeled data. We introduce a family of new SSL models that optimize for calibration and demonstrate their effectiveness across standard vision benchmarks.
arXiv Detail & Related papers (2022-10-10T15:41:44Z)
Identifying and mitigating bias in algorithms used to manage patients in a pandemic [4.756860520861679]
Logistic regression models were created to predict COVID-19 mortality, ventilator status and inpatient status using a real-world dataset. Models showed a 57% decrease in the number of biased trials. After calibration, the average sensitivity of the predictive models increased from 0.527 to 0.955.
arXiv Detail & Related papers (2021-10-30T21:10:56Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research. Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.