Mind the Gap: Benchmarking LLM Uncertainty, Discrimination, and Calibration in Specialty-Aware Clinical QA
- URL: http://arxiv.org/abs/2506.10769v2
- Date: Tue, 12 Aug 2025 15:15:15 GMT
- Title: Mind the Gap: Benchmarking LLM Uncertainty, Discrimination, and Calibration in Specialty-Aware Clinical QA
- Authors: Alberto Testoni, Iacer Calixto,
- Abstract summary: We evaluate uncertainty estimation methods for clinical question answering (QA) focusing.<n>We present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models.<n>Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type.
- Score: 4.501692468580528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable uncertainty quantification (UQ) is essential when employing large language models (LLMs) in high-risk domains such as clinical question answering (QA). In this work, we evaluate uncertainty estimation methods for clinical QA focusing, for the first time, on eleven clinical specialties and six question types, and across ten open-source LLMs (general-purpose, biomedical, and reasoning models). We analyze score-based UQ methods, present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models, and examine conformal prediction as a complementary set-based approach. Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type due to shifts in calibration and discrimination. Our results highlight the need to select or ensemble models based on their distinct, complementary strengths and clinical use.
Related papers
- Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification [60.18369393468405]
Existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration.<n>GLEAN compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals.<n>We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset.
arXiv Detail & Related papers (2026-03-03T09:36:43Z) - An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification [11.640422721732756]
We empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification.<n>We find that selective prediction can substantially degrade performance despite strong standard evaluation metrics.<n>This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones.
arXiv Detail & Related papers (2026-03-03T08:16:44Z) - A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis [1.6690512882610855]
Deep learning models can be used to continuously monitor physiological parameters outside of clinical settings.<n>There is risk of poor performance when deployed in practical measurement scenarios leading to negative patient outcomes.<n>Here we implement eight uncertainty (UQ) techniques to models trained on two clinically relevant prediction tasks.
arXiv Detail & Related papers (2025-10-31T22:54:13Z) - A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist [1.1731001328350983]
This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset.<n>We analyze both cognitive adaptation and calibration error using metrics: Expected Error (ECE) and a baseline-normalized Relative Error (RCE)<n>Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role-playing conditions.
arXiv Detail & Related papers (2025-10-22T00:15:02Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - Consistency of Feature Attribution in Deep Learning Architectures for Multi-Omics [0.36646002427839136]
We investigate the use of Shapley Additive Explanations (SHAP) on a multi-view deep learning model applied to multi-omics data.<n> Rankings of features via SHAP are compared across various architectures to evaluate consistency of the method.<n>We present an alternative, simple method to assess the robustness of identification of important biomolecules.
arXiv Detail & Related papers (2025-07-30T17:53:42Z) - Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs [3.299877799532224]
We propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers.<n>We derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance.<n>The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
arXiv Detail & Related papers (2025-06-17T14:01:39Z) - Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics [8.692647930497936]
We use conformal analysis to quantify the predictive uncertainty of a vision transformer based foundation model.<n>We show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model.
arXiv Detail & Related papers (2025-03-31T08:06:00Z) - Uncertainty-aware abstention in medical diagnosis based on medical texts [87.88110503208016]
This study addresses the critical issue of reliability for AI-assisted medical diagnosis.<n>We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis.<n>We introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks.
arXiv Detail & Related papers (2025-02-25T10:15:21Z) - Stabilizing Machine Learning for Reproducible and Explainable Results: A Novel Validation Approach to Subject-Specific Insights [2.7516838144367735]
We propose a novel validation approach that uses a general ML model to ensure reproducible performance and robust feature importance analysis.<n>We tested a single Random Forest (RF) model on nine datasets varying in domain, sample size, and demographics.<n>Our repeated trials approach consistently identified key features at the subject level and improved group-level feature importance analysis.
arXiv Detail & Related papers (2024-12-16T23:14:26Z) - Predictive uncertainty estimation in deep learning for lung carcinoma classification in digital pathology under real dataset shifts [2.309018557701645]
This paper evaluates whether predictive uncertainty estimation adds robustness to deep learning-based diagnostic decision-making systems.
We first investigate three popular methods for improving predictive uncertainty: Monte Carlo dropout, deep ensemble, and few-shot learning on lung adenocarcinoma classification as a primary disease in whole slide images.
arXiv Detail & Related papers (2024-08-15T21:49:43Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - Unified Uncertainty Estimation for Cognitive Diagnosis Models [70.46998436898205]
We propose a unified uncertainty estimation approach for a wide range of cognitive diagnosis models.
We decompose the uncertainty of diagnostic parameters into data aspect and model aspect.
Our method is effective and can provide useful insights into the uncertainty of cognitive diagnosis.
arXiv Detail & Related papers (2024-03-09T13:48:20Z) - Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond [52.246494389096654]
This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels.
We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs)
arXiv Detail & Related papers (2024-02-22T03:46:08Z) - The Relevance Feature and Vector Machine for health applications [0.11538034264098687]
This paper presents a novel model that addresses the challenges of the fat-data problem when dealing with clinical prospective studies.
The model capabilities are tested against state-of-the-art models in several medical datasets with fat-data problems.
arXiv Detail & Related papers (2024-02-11T01:21:56Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty [57.023423137202485]
Concerns regarding the reliability of medical image segmentation persist among clinicians.<n>We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.<n>By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.