Does deep learning model calibration improve performance in
class-imbalanced medical image classification?
- URL: http://arxiv.org/abs/2110.00918v2
- Date: Wed, 6 Oct 2021 13:39:47 GMT
- Title: Does deep learning model calibration improve performance in
class-imbalanced medical image classification?
- Authors: Sivaramakrishnan Rajaraman, Prasanth Ganesan, Sameer Antani
- Abstract summary: We perform a systematic analysis of the effect of model calibration on its performance on two medical image modalities.
Our results indicate that at the default operating threshold of 0.5, the performance achieved through calibration is significantly superior to using uncalibrated probabilities.
- Score: 0.8594140167290096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In medical image classification tasks, it is common to find that the number
of normal samples far exceeds the number of abnormal samples. In such
class-imbalanced situations, reliable training of deep neural networks
continues to be a major challenge. Under these circumstances, the predicted
class probabilities may be biased toward the majority class. Calibration has
been suggested to alleviate some of these effects. However, there is
insufficient analysis explaining when and whether calibrating a model would be
beneficial in improving performance. In this study, we perform a systematic
analysis of the effect of model calibration on its performance on two medical
image modalities, namely, chest X-rays and fundus images, using various deep
learning classifier backbones. For this, we study the following variations: (i)
the degree of imbalances in the dataset used for training; (ii) calibration
methods; and (iii) two classification thresholds, namely, default decision
threshold of 0.5, and optimal threshold from precision-recall curves. Our
results indicate that at the default operating threshold of 0.5, the
performance achieved through calibration is significantly superior (p < 0.05)
to using uncalibrated probabilities. However, at the PR-guided threshold, these
gains are not significantly different (p > 0.05). This finding holds for both
image modalities and at varying degrees of imbalance.
Related papers
- Automatic hip osteoarthritis grading with uncertainty estimation from
computed tomography using digitally-reconstructed radiographs [5.910133714106733]
The severity of hip osteoarthritis (hip OA) is often classified using the Crowe and Kellgren-Lawrence classifications.
Deep learning models were trained to predict the disease grade using two grading schemes.
The models produced a comparable accuracy of approximately 0.65 (ECA) and 0.95 (ONCA) in the classification and regression settings.
arXiv Detail & Related papers (2023-12-30T07:28:56Z) - On the calibration of neural networks for histological slide-level
classification [47.99822253865054]
We compare three neural network architectures that combine feature representations on patch-level to a slide-level prediction with respect to their classification performance.
We observe that Transformers lead to good results in terms of classification performance and calibration.
arXiv Detail & Related papers (2023-12-15T11:46:29Z) - Mitigating Calibration Bias Without Fixed Attribute Grouping for
Improved Fairness in Medical Imaging Analysis [2.8943928153775826]
Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias.
We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients.
arXiv Detail & Related papers (2023-07-04T14:14:12Z) - Performance of GAN-based augmentation for deep learning COVID-19 image
classification [57.1795052451257]
The biggest challenge in the application of deep learning to the medical domain is the availability of training data.
Data augmentation is a typical methodology used in machine learning when confronted with a limited data set.
In this work, a StyleGAN2-ADA model of Generative Adversarial Networks is trained on the limited COVID-19 chest X-ray image set.
arXiv Detail & Related papers (2023-04-18T15:39:58Z) - Multi-Head Multi-Loss Model Calibration [13.841172927454204]
We introduce a form of simplified ensembling that bypasses the costly training and inference of deep ensembles.
Specifically, each head is trained to minimize a weighted Cross-Entropy loss, but the weights are different among the different branches.
We show that the resulting averaged predictions can achieve excellent calibration without sacrificing accuracy in two challenging datasets.
arXiv Detail & Related papers (2023-03-02T09:32:32Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - DOMINO: Domain-aware Model Calibration in Medical Image Segmentation [51.346121016559024]
Modern deep neural networks are poorly calibrated, compromising trustworthiness and reliability.
We propose DOMINO, a domain-aware model calibration method that leverages the semantic confusability and hierarchical similarity between class labels.
Our results show that DOMINO-calibrated deep neural networks outperform non-calibrated models and state-of-the-art morphometric methods in head image segmentation.
arXiv Detail & Related papers (2022-09-13T15:31:52Z) - T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem.
detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.
We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z) - Multi-loss ensemble deep learning for chest X-ray classification [0.8594140167290096]
Class imbalance is common in medical image classification tasks, where the number of abnormal samples is fewer than the number of normal samples.
We propose novel loss functions to train a DL model and analyze its performance in a multiclass classification setting.
arXiv Detail & Related papers (2021-09-29T14:14:04Z) - Improved Trainable Calibration Method for Neural Networks on Medical
Imaging Classification [17.941506832422192]
Empirically, neural networks are often miscalibrated and overconfident in their predictions.
We propose a novel calibration approach that maintains the overall classification accuracy while significantly improving model calibration.
arXiv Detail & Related papers (2020-09-09T01:25:53Z) - Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions.
We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test.
Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.