Related papers: XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography

XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography

URL: http://arxiv.org/abs/2510.19599v1
Date: Wed, 22 Oct 2025 13:52:19 GMT
Title: XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
Authors: Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora, Mauricio Reyes,
Abstract summary: We present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays.<n>We generate visual explanations using cross-attention and similarity-based localization maps.<n>We quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies.
Score: 6.447908430647854
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention

Related papers

Multi-View Stenosis Classification Leveraging Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data [76.89269238957593]
Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views.<n>We propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification.
arXiv Detail & Related papers (2026-02-02T13:07:52Z)
Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation [25.148217482604746]
We propose VALOR:Visual Alignment of Medical Vision-Language Models for Radiology Report Generation.<n>Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO)<n>Experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
arXiv Detail & Related papers (2025-12-18T05:48:21Z)
S-Chain: Structured Visual Chain-of-Thought For Medicine [81.97605645734741]
We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT)<n>The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability.<n>S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical vision-language models.
arXiv Detail & Related papers (2025-10-26T15:57:14Z)
Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z)
RAU: Reference-based Anatomical Understanding with Vision Language Models [26.06602931463068]
We introduce RAU, a framework for reference-based anatomical understanding with vision-language models (VLMs)<n>We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images.<n>Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2.
arXiv Detail & Related papers (2025-09-26T14:32:03Z)
Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology [0.0]
We analyse the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution medical dataset.<n>We show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability.<n>We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting.
arXiv Detail & Related papers (2025-06-17T02:59:42Z)
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding [45.248265283405004]
Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations.<n>We introduce Disease-Aware Prompting (DAP), which uses the explainability map of a VLM to identify the appropriate image features.<n>DAP improves visual grounding accuracy by 20.74% compared to state-of-the-art methods across three major chest X-ray datasets.
arXiv Detail & Related papers (2025-05-21T05:16:45Z)
From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation [48.45209969191245]
Vision-language models (VLMs) provide semantic context through textual descriptions but lack explanation precision required.<n>We propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths.<n>Our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively, improving 3-5% over gaze baselines without increasing the annotation burden.
arXiv Detail & Related papers (2025-04-15T16:32:15Z)
CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting [0.0]
We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation. We find that vision-language models often hallucinate with confident language, which slows down clinical interpretation. We develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools.
arXiv Detail & Related papers (2024-07-11T18:39:19Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.