Calibration-Aware Prompt Learning for Medical Vision-Language Models
- URL: http://arxiv.org/abs/2509.15226v1
- Date: Thu, 18 Sep 2025 17:59:58 GMT
- Title: Calibration-Aware Prompt Learning for Medical Vision-Language Models
- Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan,
- Abstract summary: Miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability.<n>We introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning.<n>CalibPrompt consistently improves calibration without drastically affecting clean accuracy.
- Score: 44.97741487992985
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt optimizes a small set of learnable prompts with carefully designed calibration objectives under scarce labeled data regime. First, we study a regularizer that attempts to align the smoothed accuracy with the predicted model confidences. Second, we introduce an angular separation loss to maximize textual feature proximity toward improving the reliability in confidence estimates of multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs and five diverse medical imaging datasets reveal that CalibPrompt consistently improves calibration without drastically affecting clean accuracy. Our code is available at https://github.com/iabh1shekbasu/CalibPrompt.
Related papers
- On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z) - Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models [1.4008409814572673]
We introduce Prompt4Trust, the first reinforcement learning framework for prompt augmentation targeting confidence calibration in MLLMs.<n>Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making.<n>Our framework showed promising zero-shot generalizations to larger MLLMs in our experiments.
arXiv Detail & Related papers (2025-07-12T13:21:10Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification [8.43909252072479]
Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis.<n>We present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification.<n>We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases.
arXiv Detail & Related papers (2025-06-29T15:37:17Z) - Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation [14.869379716339212]
Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility.<n>We propose differentiable formulations of marginal L1 Average Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis.<n>We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance.
arXiv Detail & Related papers (2025-06-04T13:32:07Z) - Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration.<n>This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information.<n>We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z) - Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration [20.049443396032423]
Black-box large language models (LLMs) are increasingly deployed in various environments.
LLMs often exhibit overconfidence, leading to potential risks and misjudgments.
We propose a novel method, textitAtypical presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates.
arXiv Detail & Related papers (2024-09-05T03:45:35Z) - Bridging Precision and Confidence: A Train-Time Loss for Calibrating
Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions.
Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z) - Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty [57.023423137202485]
Concerns regarding the reliability of medical image segmentation persist among clinicians.<n>We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.<n>By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - DOMINO: Domain-aware Model Calibration in Medical Image Segmentation [51.346121016559024]
Modern deep neural networks are poorly calibrated, compromising trustworthiness and reliability.
We propose DOMINO, a domain-aware model calibration method that leverages the semantic confusability and hierarchical similarity between class labels.
Our results show that DOMINO-calibrated deep neural networks outperform non-calibrated models and state-of-the-art morphometric methods in head image segmentation.
arXiv Detail & Related papers (2022-09-13T15:31:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.