Related papers: DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

URL: http://arxiv.org/abs/2505.24173v1
Date: Fri, 30 May 2025 03:33:25 GMT
Title: DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
Authors: Tianhong Zhou, Yin Xu, Yingtao Zhu, Chuxi Xiao, Haiyang Bian, Lei Wei, Xuegong Zhang,
Abstract summary: We propose DrVD-Bench, the first benchmark for clinical visual reasoning.<n>DrVD-Bench consists of three modules: Visual Evidence, Reasoning Trajectory Assessment, and Report Generation Evaluation.<n>Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology.
Score: 1.1094764204428438
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visual reasoning. DrVD-Bench consists of three modules: Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report Generation Evaluation, comprising a total of 7,789 image-question pairs. Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is explicitly structured to reflect the clinical reasoning workflow from modality recognition to lesion identification and diagnosis. We benchmark 19 VLMs, including general-purpose and medical-specific, open-source and proprietary models, and observe that performance drops sharply as reasoning complexity increases. While some models begin to exhibit traces of human-like reasoning, they often still rely on shortcut correlations rather than grounded visual understanding. DrVD-Bench offers a rigorous and structured evaluation framework to guide the development of clinically trustworthy VLMs.

Related papers

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z)
On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI [4.866086225040713]
We introduce a perturbation-based approach to quantify a model's reliance on each modality in binary classification tasks.<n>By swapping images or text between samples with opposing labels, we expose modality-specific biases.
arXiv Detail & Related papers (2025-07-31T21:35:52Z)
How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study [16.84832179579428]
Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare.<n>We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, across eight benchmarks.<n>First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images.<n>Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support.
arXiv Detail & Related papers (2025-07-15T11:12:39Z)
See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis [30.3617091206683]
Medical vision-language models (VLMs) focus primarily on single-image or single-series analyses.<n>We show that providing general-purpose VLMs with query and normative matched reference images, accompanied by clinically-informed comparative prompts, significantly improves diagnostic outcomes.
arXiv Detail & Related papers (2025-06-22T18:59:44Z)
MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning [24.9872402922819]
Existing medical VQA benchmarks mostly focus on single-image analysis.<n>We introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA.
arXiv Detail & Related papers (2025-05-22T17:46:11Z)
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models [6.176432104264649]
Vision-language models (VLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored.<n>We propose Med-R1, a reinforcement learning (RL)-enhanced vision-language model designed to improve generalization and reliability in medical reasoning.<n>We evaluate Med-R1 across eight distinct medical imaging modalities.
arXiv Detail & Related papers (2025-03-18T06:12:38Z)
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions. VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z)
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge [93.61262892578067]
Uncertainty in medical image segmentation tasks, especially inter-rater variability, presents a significant challenge. This variability directly impacts the development and evaluation of automated segmentation algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ)
arXiv Detail & Related papers (2024-03-19T17:57:24Z)
Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis [24.073780692427437]
Defining pathologies automatically from medical images aids the understanding of the emergence and progression of diseases.<n>Existing deep learning models rely on expert annotations and lack generalization capabilities in open clinical environments.<n>We present a vision-language model for.<n> localization-free pathology (AFLoc)<n>We conducted experiments on a dataset of 220K pairs of image-report chest X-ray images, and performed extensive validation across 8 external datasets.
arXiv Detail & Related papers (2024-01-04T03:09:39Z)
Revamping AI Models in Dermatology: Overcoming Critical Challenges for Enhanced Skin Lesion Diagnosis [8.430482797862926]
We present an All-In-One textbfHierarchical-textbfOut of Distribution-textbfClinical Triage model. For a clinical image, our model generates three outputs: a hierarchical prediction, an alert for out-of-distribution images, and a recommendation for dermoscopy. Our versatile model provides valuable decision support for lesion diagnosis and sets a promising precedent for medical AI applications.
arXiv Detail & Related papers (2023-11-02T06:08:49Z)
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z)
Act Like a Radiologist: Towards Reliable Multi-view Correspondence Reasoning for Mammogram Mass Detection [49.14070210387509]
We propose an Anatomy-aware Graph convolutional Network (AGN) for mammogram mass detection. AGN is tailored for mammogram mass detection and endows existing detection methods with multi-view reasoning ability. Experiments on two standard benchmarks reveal that AGN significantly exceeds the state-of-the-art performance.
arXiv Detail & Related papers (2021-05-21T06:48:34Z)
Malignancy Prediction and Lesion Identification from Clinical Dermatological Images [65.1629311281062]
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images. We first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy.
arXiv Detail & Related papers (2021-04-02T20:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.