Mitigating Calibration Bias Without Fixed Attribute Grouping for
Improved Fairness in Medical Imaging Analysis
- URL: http://arxiv.org/abs/2307.01738v2
- Date: Thu, 20 Jul 2023 17:53:41 GMT
- Title: Mitigating Calibration Bias Without Fixed Attribute Grouping for
Improved Fairness in Medical Imaging Analysis
- Authors: Changjian Shui, Justin Szeto, Raghav Mehta, Douglas L. Arnold, Tal
Arbel
- Abstract summary: Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias.
We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients.
- Score: 2.8943928153775826
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Trustworthy deployment of deep learning medical imaging models into
real-world clinical practice requires that they be calibrated. However, models
that are well calibrated overall can still be poorly calibrated for a
sub-population, potentially resulting in a clinician unwittingly making poor
decisions for this group based on the recommendations of the model. Although
methods have been shown to successfully mitigate biases across subgroups in
terms of model accuracy, this work focuses on the open problem of mitigating
calibration biases in the context of medical image analysis. Our method does
not require subgroup attributes during training, permitting the flexibility to
mitigate biases for different choices of sensitive attributes without
re-training. To this end, we propose a novel two-stage method: Cluster-Focal to
first identify poorly calibrated samples, cluster them into groups, and then
introduce group-wise focal loss to improve calibration bias. We evaluate our
method on skin lesion classification with the public HAM10000 dataset, and on
predicting future lesional activity for multiple sclerosis (MS) patients. In
addition to considering traditional sensitive attributes (e.g. age, sex) with
demographic subgroups, we also consider biases among groups with different
image-derived attributes, such as lesion load, which are required in medical
image analysis. Our results demonstrate that our method effectively controls
calibration error in the worst-performing subgroups while preserving prediction
performance, and outperforming recent baselines.
Related papers
- Bias Amplification Enhances Minority Group Performance [10.380812738348899]
We propose BAM, a novel two-stage training algorithm.
In the first stage, the model is trained using a bias amplification scheme via introducing a learnable auxiliary variable for each training sample.
In the second stage, we upweight the samples that the bias-amplified model misclassifies, and then continue training the same model on the reweighted dataset.
arXiv Detail & Related papers (2023-09-13T04:40:08Z) - Is this model reliable for everyone? Testing for strong calibration [4.893345190925178]
In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup.
The task of auditing a model for strong calibration is well-known to be difficult due to the sheer number of potential subgroups.
Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal.
arXiv Detail & Related papers (2023-07-28T00:59:14Z) - Calibration of Neural Networks [77.34726150561087]
This paper presents a survey of confidence calibration problems in the context of neural networks.
We analyze problem statement, calibration definitions, and different approaches to evaluation.
Empirical experiments cover various datasets and models, comparing calibration methods according to different criteria.
arXiv Detail & Related papers (2023-03-19T20:27:51Z) - Rethinking Semi-Supervised Medical Image Segmentation: A
Variance-Reduction Perspective [51.70661197256033]
We propose ARCO, a semi-supervised contrastive learning framework with stratified group theory for medical image segmentation.
We first propose building ARCO through the concept of variance-reduced estimation and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks.
We experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings.
arXiv Detail & Related papers (2023-02-03T13:50:25Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - Fair admission risk prediction with proportional multicalibration [0.16249424686052708]
Multicalibration constrains calibration error among flexibly-defined subpopulations.
It is possible for a decision-maker to learn to trust or distrust model predictions for specific groups.
We propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins.
arXiv Detail & Related papers (2022-09-29T08:15:29Z) - Density-Aware Personalized Training for Risk Prediction in Imbalanced
Medical Data [89.79617468457393]
Training models with imbalance rate (class density discrepancy) may lead to suboptimal prediction.
We propose a framework for training models for this imbalance issue.
We demonstrate our model's improved performance in real-world medical datasets.
arXiv Detail & Related papers (2022-07-23T00:39:53Z) - Prototypical Calibration for Few-shot Learning of Language Models [84.5759596754605]
GPT-like models have been recognized as fragile across different hand-crafted templates, and demonstration permutations.
We propose prototypical calibration to adaptively learn a more robust decision boundary for zero- and few-shot classification.
Our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.
arXiv Detail & Related papers (2022-05-20T13:50:07Z) - On the Calibration of Pre-trained Language Models using Mixup Guided by
Area Under the Margin and Saliency [47.90235939359225]
We propose a novel mixup strategy for pre-trained language models that improves model calibration further.
Our method achieves the lowest expected calibration error compared to strong baselines on both in-domain and out-of-domain test samples.
arXiv Detail & Related papers (2022-03-14T23:45:08Z) - Does deep learning model calibration improve performance in
class-imbalanced medical image classification? [0.8594140167290096]
We perform a systematic analysis of the effect of model calibration on its performance on two medical image modalities.
Our results indicate that at the default operating threshold of 0.5, the performance achieved through calibration is significantly superior to using uncalibrated probabilities.
arXiv Detail & Related papers (2021-09-29T12:00:32Z) - Improved Trainable Calibration Method for Neural Networks on Medical
Imaging Classification [17.941506832422192]
Empirically, neural networks are often miscalibrated and overconfident in their predictions.
We propose a novel calibration approach that maintains the overall classification accuracy while significantly improving model calibration.
arXiv Detail & Related papers (2020-09-09T01:25:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.