Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs
- URL: http://arxiv.org/abs/2506.14540v3
- Date: Mon, 30 Jun 2025 11:59:31 GMT
- Title: Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs
- Authors: Gerardo A. Flores, Alyssa H. Smith, Julia A. Fukuyama, Ashia C. Wilson,
- Abstract summary: We propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers.<n>We derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance.<n>The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
- Score: 3.299877799532224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning-based decision support systems are increasingly deployed in clinical settings, where probabilistic scoring functions are used to inform and prioritize patient management decisions. However, widely used scoring rules, such as accuracy and AUC-ROC, fail to adequately reflect key clinical priorities, including calibration, robustness to distributional shifts, and sensitivity to asymmetric error costs. In this work, we propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers that explicitly accounts for the uncertainty in class prevalences and domain-specific cost asymmetries often found in clinical settings. Building on the theory of proper scoring rules, particularly the Schervish representation, we derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance. The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
Related papers
- CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation [6.930435788495898]
We propose the CRG Score, a metric that evaluates only clinically relevant abnormalities explicitly described in reference reports.<n>By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.
arXiv Detail & Related papers (2025-05-22T17:02:28Z) - A Consequentialist Critique of Binary Classification Evaluation Practices [4.603739046972463]
We show a preference for top-K metrics or fixed thresholds in evaluations at major conferences like ICML, FAccT, and CHIL.<n>We use this decision-theoretic framework to map evaluation metrics to their optimal use cases, along with a Python package, briertools, to promote the broader adoption of Brier scores.
arXiv Detail & Related papers (2025-04-06T15:58:01Z) - From Uncertainty to Precision: Enhancing Binary Classifier Performance
through Calibration [0.3495246564946556]
Given that model-predicted scores are commonly seen as event probabilities, calibration is crucial for accurate interpretation.
We analyze the sensitivity of various calibration measures to score distortions and introduce a refined metric, the Local Score.
We apply these findings in a real-world scenario using Random Forest classifier and regressor to predict credit default while simultaneously measuring calibration.
arXiv Detail & Related papers (2024-02-12T16:55:19Z) - Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences.
Our method is especially suitable for problems with well-specified likelihoods.
We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z) - Calibration by Distribution Matching: Trainable Kernel Calibration
Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression.
These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization.
We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.
By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation.
DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - Better Uncertainty Calibration via Proper Scores for Classification and
Beyond [15.981380319863527]
We introduce the framework of proper calibration errors, which relates every calibration error to a proper score.
This relationship can be used to reliably quantify the model calibration improvement.
arXiv Detail & Related papers (2022-03-15T12:46:08Z) - Improving the compromise between accuracy, interpretability and
personalization of rule-based machine learning in medical problems [0.08594140167290096]
We introduce a new component to predict if a given rule will be correct or not for a particular patient, which introduces personalization into the procedure.
The validation results using three public clinical datasets show that it also allows to increase the predictive performance of the selected set of rules.
arXiv Detail & Related papers (2021-06-15T01:19:04Z) - Unsupervised Calibration under Covariate Shift [92.02278658443166]
We introduce the problem of calibration under domain shift and propose an importance sampling based approach to address it.
We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.
arXiv Detail & Related papers (2020-06-29T21:50:07Z) - Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions.
We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test.
Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.