Related papers: How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

URL: http://arxiv.org/abs/2507.11200v2
Date: Fri, 18 Jul 2025 16:56:02 GMT
Title: How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study
Authors: Che Liu, Jiazhen Pan, Weixiang Shen, Wenjia Bai, Daniel Rueckert, Rossella Arcucci,
Abstract summary: Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare.<n>We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, across eight benchmarks.<n>First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images.<n>Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support.
Score: 16.84832179579428
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.

Related papers

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
MedGemma Technical Report [75.88152277443179]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z)
CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making [42.28216499263317]
We introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks.<n>We propose a novel large-scale RL framework for Med-VLMs, which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer, and rule-based accuracy for final responses.
arXiv Detail & Related papers (2025-06-15T13:42:46Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? [1.1094764204428438]
We propose DrVD-Bench, the first benchmark for clinical visual reasoning.<n>DrVD-Bench consists of three modules: Visual Evidence, Reasoning Trajectory Assessment, and Report Generation Evaluation.<n>Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology.
arXiv Detail & Related papers (2025-05-30T03:33:25Z)
EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model [51.66031028717933]
Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare.<n>Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data; (ii) Benchmark; and (iii) Model.<n>We propose the Eyecare Kit, which tackles the aforementioned three key challenges with the tailored dataset, benchmark and model.
arXiv Detail & Related papers (2025-04-18T12:09:15Z)
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models [6.176432104264649]
Vision-language models (VLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored.<n>We propose Med-R1, a reinforcement learning (RL)-enhanced vision-language model designed to improve generalization and reliability in medical reasoning.<n>We evaluate Med-R1 across eight distinct medical imaging modalities.
arXiv Detail & Related papers (2025-03-18T06:12:38Z)
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning [34.93995619867384]
Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks.<n>We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios.
arXiv Detail & Related papers (2025-03-10T15:38:44Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering.<n>We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system.<n>Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z)
Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering [4.283761158899643]
We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA dataset with an overall accuracy of 87.5%.
arXiv Detail & Related papers (2024-04-24T20:31:15Z)
A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.<n>Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.<n>We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.