Related papers: Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

URL: http://arxiv.org/abs/2506.23298v3
Date: Thu, 17 Jul 2025 18:00:33 GMT
Title: Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification
Authors: Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, Tal Arbel,
Abstract summary: Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis.<n>We present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification.<n>We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases.
Score: 8.43909252072479
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN's effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.

Related papers

Rethinking Early Stopping: Refine, Then Calibrate [49.966899634962374]
We present a novel variational formulation of the calibration-refinement decomposition.<n>We provide theoretical and empirical evidence that calibration and refinement errors are not minimized simultaneously during training.
arXiv Detail & Related papers (2025-01-31T15:03:54Z)
Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification [15.98427699337596]
We perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification.<n>We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques.
arXiv Detail & Related papers (2025-01-31T12:23:50Z)
Mitigating Calibration Bias Without Fixed Attribute Grouping for Improved Fairness in Medical Imaging Analysis [2.8943928153775826]
Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias. We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients.
arXiv Detail & Related papers (2023-07-04T14:14:12Z)
Calibration of Neural Networks [77.34726150561087]
This paper presents a survey of confidence calibration problems in the context of neural networks. We analyze problem statement, calibration definitions, and different approaches to evaluation. Empirical experiments cover various datasets and models, comparing calibration methods according to different criteria.
arXiv Detail & Related papers (2023-03-19T20:27:51Z)
On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration. Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z)
Fair admission risk prediction with proportional multicalibration [0.16249424686052708]
Multicalibration constrains calibration error among flexibly-defined subpopulations. It is possible for a decision-maker to learn to trust or distrust model predictions for specific groups. We propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins.
arXiv Detail & Related papers (2022-09-29T08:15:29Z)
DOMINO: Domain-aware Model Calibration in Medical Image Segmentation [51.346121016559024]
Modern deep neural networks are poorly calibrated, compromising trustworthiness and reliability. We propose DOMINO, a domain-aware model calibration method that leverages the semantic confusability and hierarchical similarity between class labels. Our results show that DOMINO-calibrated deep neural networks outperform non-calibrated models and state-of-the-art morphometric methods in head image segmentation.
arXiv Detail & Related papers (2022-09-13T15:31:52Z)
On the Calibration of Pre-trained Language Models using Mixup Guided by Area Under the Margin and Saliency [47.90235939359225]
We propose a novel mixup strategy for pre-trained language models that improves model calibration further. Our method achieves the lowest expected calibration error compared to strong baselines on both in-domain and out-of-domain test samples.
arXiv Detail & Related papers (2022-03-14T23:45:08Z)
On the relationship between calibrated predictors and unbiased volume estimation [18.96093589337619]
Machine learning driven medical image segmentation has become standard in medical image analysis. However, deep learning models are prone to overconfident predictions. This has led to a renewed focus on calibrated predictions in the medical imaging and broader machine learning communities.
arXiv Detail & Related papers (2021-12-23T14:22:19Z)
Does deep learning model calibration improve performance in class-imbalanced medical image classification? [0.8594140167290096]
We perform a systematic analysis of the effect of model calibration on its performance on two medical image modalities. Our results indicate that at the default operating threshold of 0.5, the performance achieved through calibration is significantly superior to using uncalibrated probabilities.
arXiv Detail & Related papers (2021-09-29T12:00:32Z)
Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions. We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test. Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.