Related papers: Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation

Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation

URL: http://arxiv.org/abs/2506.03942v2
Date: Fri, 11 Jul 2025 09:35:23 GMT
Title: Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation
Authors: Theodore Barfoot, Luis C. Garcia-Peraza-Herrera, Samet Akcay, Ben Glocker, Tom Vercauteren,
Abstract summary: Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility.<n>We propose differentiable formulations of marginal L1 Average Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis.<n>We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance.
Score: 14.869379716339212
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses

Related papers

We Care Each Pixel: Calibrating on Medical Segmentation Model [15.826029150910566]
pixel-wise Expected Error (pECE) is a novel metric that measures miscalibration at the pixel level.<n>We also introduce a morphological adaptation strategy that applies morphological operations to ground-truth masks before computing calibration losses.<n>Our method not only enhances segmentation performance but also improves calibration quality, yielding more trustworthy confidence estimates.
arXiv Detail & Related papers (2025-03-07T03:06:03Z)
Average Calibration Error: A Differentiable Loss for Improved Reliability in Image Segmentation [17.263160921956445]
We propose to use marginal L1 average calibration error (mL1-ACE) as a novel auxiliary loss function to improve pixel-wise calibration without compromising segmentation quality. We show that this loss, despite using hard binning, is directly differentiable, bypassing the need for approximate but differentiable surrogate or soft binning approaches.
arXiv Detail & Related papers (2024-03-11T14:31:03Z)
Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks. By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation. DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z)
On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration. Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z)
A Closer Look at the Calibration of Differentially Private Learners [33.715727551832785]
We study the calibration of classifiers trained with differentially private descent gradient (DP-SGD) Our analysis identifies per-example gradient clipping in DP-SGD as a major cause of miscalibration. We show that differentially private variants of post-processing calibration methods such as temperature scaling and Platt scaling are surprisingly effective.
arXiv Detail & Related papers (2022-10-15T10:16:18Z)
DOMINO: Domain-aware Model Calibration in Medical Image Segmentation [51.346121016559024]
Modern deep neural networks are poorly calibrated, compromising trustworthiness and reliability. We propose DOMINO, a domain-aware model calibration method that leverages the semantic confusability and hierarchical similarity between class labels. Our results show that DOMINO-calibrated deep neural networks outperform non-calibrated models and state-of-the-art morphometric methods in head image segmentation.
arXiv Detail & Related papers (2022-09-13T15:31:52Z)
Calibrating the Dice loss to handle neural network overconfidence for biomedical image segmentation [2.6465053740712157]
The Dice similarity coefficient (DSC) is a widely used metric and loss function for biomedical image segmentation. In this study, we identify poor calibration as an emerging challenge of deep learning based biomedical image segmentation. We provide a simple yet effective extension of the DSC loss, named the DSC++ loss, that selectively modulates the penalty associated with overconfident, incorrect predictions.
arXiv Detail & Related papers (2021-10-31T16:02:02Z)
Privacy Preserving Recalibration under Domain Shift [119.21243107946555]
We introduce a framework that abstracts out the properties of recalibration problems under differential privacy constraints. We also design a novel recalibration algorithm, accuracy temperature scaling, that outperforms prior work on private datasets.
arXiv Detail & Related papers (2020-08-21T18:43:37Z)
Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions. We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test. Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z)
Calibrating Deep Neural Networks using Focal Loss [77.92765139898906]
Miscalibration is a mismatch between a model's confidence and its correctness. We show that focal loss allows us to learn models that are already very well calibrated. We show that our approach achieves state-of-the-art calibration without compromising on accuracy in almost all cases.
arXiv Detail & Related papers (2020-02-21T17:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.