Related papers: Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations

Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations

URL: http://arxiv.org/abs/2510.11196v2
Date: Sun, 09 Nov 2025 10:56:57 GMT
Title: Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
Authors: Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C. Adams, Keno K. Bressem,
Abstract summary: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process.<n>We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications.
Score: 19.488236277427358
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality can be decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.

Related papers

Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment [0.0]
Small open-source language models are gaining attention for healthcare applications in low-resource settings.<n>We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B) across three clinical question answering datasets.
arXiv Detail & Related papers (2026-03-01T04:37:48Z)
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning [10.264467364282865]
Similarity-as-Evidence (SaE) calibrates text-image similarities by introducing a Similarity Evidence Head (SEH)<n>SaE attains state-of-the-art macro-averaged accuracy of 82.57% on medical imaging datasets with a 20% label budget.
arXiv Detail & Related papers (2026-02-21T15:21:54Z)
Same Answer, Different Representations: Hidden instability in VLMs [65.36933543377346]
We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
arXiv Detail & Related papers (2026-02-06T12:24:26Z)
Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes [79.36545159724703]
We propose Latent Representation Probing (LRP) to train lightweight probes on hidden states or attention patterns.<n>LRP improves abstention accuracy by 7.6% over best baselines.<n>This establishes a principled framework for building deployment-ready AI systems.
arXiv Detail & Related papers (2025-11-25T00:24:42Z)
VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning [3.4998703934432682]
VinDr-CXR-VQA is a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding.<n>The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations.
arXiv Detail & Related papers (2025-11-01T11:17:44Z)
CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z)
Evaluating Large Language Models for Evidence-Based Clinical Question Answering [4.101088122511548]
Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications.<n>We curate a benchmark drawing from Cochrane systematic reviews and clinical guidelines.<n>We observe consistent performance patterns across sources and clinical domains.
arXiv Detail & Related papers (2025-09-13T15:03:34Z)
Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift [12.373281238541296]
Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment.<n>We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise.<n>Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods.
arXiv Detail & Related papers (2025-09-11T12:26:57Z)
mFARM: Towards Multi-Faceted Fairness Assessment based on HARMs in Clinical Decision Support [10.90604216960609]
The deployment of Large Language Models (LLMs) in high-stakes medical settings poses a critical AI alignment challenge.<n>Existing fairness evaluation methods fall short in these contexts as they typically use simplistic metrics that overlook the multi-dimensional nature of medical harms.<n>We propose a multi-metric framework - Multi-faceted Fairness Assessment based on hARMs ($mFARM$) to audit fairness for three distinct dimensions of disparity.<n>Our findings showcase that the proposed $mFARM$ metrics capture subtle biases more effectively under various settings.
arXiv Detail & Related papers (2025-09-02T06:47:57Z)
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z)
The Confidence Paradox: Can LLM Know When It's Wrong [5.545086863155316]
We introduce HonestVQA, a self-supervised honesty calibration framework for ethically aligned DocVQA.<n>Our model-agnostic method quantifies uncertainty to identify knowledge gaps, aligns model confidence with actual correctness using weighted loss functions, and enforces ethical response behavior via contrastive learning.<n> Empirically, HonestVQA improves DocVQA accuracy by up to 4.3% and F1 by 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets.
arXiv Detail & Related papers (2025-06-30T02:06:54Z)
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? [37.703287009808896]
Finetuning can cause spurious correlations to arise between non-essential features and the target labels.<n>We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks.<n>We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly.
arXiv Detail & Related papers (2025-06-23T06:11:43Z)
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z)
Proximity-Informed Calibration for Deep Neural Networks [49.330703634912915]
ProCal is a plug-and-play algorithm with a theoretical guarantee to adjust sample confidence based on proximity. We show that ProCal is effective in addressing proximity bias and improving calibration on balanced, long-tail, and distribution-shift settings.
arXiv Detail & Related papers (2023-06-07T16:40:51Z)
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics. Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets. Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.