Related papers: Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

URL: http://arxiv.org/abs/2508.00549v2
Date: Fri, 08 Aug 2025 12:17:05 GMT
Title: Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images
Authors: Daniel Wolf, Heiko Hillenhagen, Billurvan Taskin, Alex Bäuerle, Meinrad Beer, Michael Götz, Timo Ropinski,
Abstract summary: We evaluate the ability of state-of-the-art Vision-Language Models (VLMs) to accurately determine relative positions on medical images.<n>We investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance.<n>Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content.
Score: 8.797134639962982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.

Related papers

Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis [2.243145970857166]
This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs)<n>Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks.<n>Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs.
arXiv Detail & Related papers (2026-01-21T08:53:40Z)
Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation [12.39187443971813]
We introduce Anatomy-VLM, a vision-language model that incorporates multi-scale information.<n>First, we design a model encoder to localize key anatomical features from entire medical images.<n>Second, these regions are enriched with structured knowledge for contextually-aware interpretation.<n>Third, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction.
arXiv Detail & Related papers (2025-11-11T16:18:01Z)
Evaluating the Explainability of Vision Transformers in Medical Imaging [10.88831138993597]
This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies.<n>We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification.<n>Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets.
arXiv Detail & Related papers (2025-10-13T23:53:26Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z)
Does DINOv3 Set a New Medical Vision Standard? [67.33543059306938]
This report investigates whether DINOv3 can serve as a powerful unified encoder for medical vision tasks without domain-specific pre-training.<n>We benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation.<n>Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks.
arXiv Detail & Related papers (2025-09-08T09:28:57Z)
How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study [16.84832179579428]
Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare.<n>We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, across eight benchmarks.<n>First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images.<n>Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support.
arXiv Detail & Related papers (2025-07-15T11:12:39Z)
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? [1.1094764204428438]
We propose DrVD-Bench, the first benchmark for clinical visual reasoning.<n>DrVD-Bench consists of three modules: Visual Evidence, Reasoning Trajectory Assessment, and Report Generation Evaluation.<n>Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology.
arXiv Detail & Related papers (2025-05-30T03:33:25Z)
PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging [6.411386758550256]
PRS-Med is a framework that integrates vision-language models with segmentation capabilities to generate both accurate segmentation masks and corresponding spatial reasoning outputs.<n> MMRS dataset provides diverse, spatially-grounded question-answer pairs to address the lack of position reasoning data in medical imaging.
arXiv Detail & Related papers (2025-05-17T06:42:28Z)
Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis [44.38638601819933]
Current staging models for Diabetic Retinopathy (DR) are hardly interpretable.<n>We present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis.
arXiv Detail & Related papers (2025-03-12T20:19:07Z)
ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images [4.353855760968461]
Cross-Modal Clinical Knowledge Distiller (ClinKD) designed to enhance image-text alignment and establish more effective medical knowledge transformation mechanisms.<n>ClinKD achieves state-of-the-art performance on several datasets which are challenging for Med-VQA task.
arXiv Detail & Related papers (2025-02-09T15:08:10Z)
Fréchet Radiomic Distance (FRD): A Versatile Metric for Comparing Medical Imaging Datasets [13.737058479403311]
We introduce a new perceptual metric tailored for medical images, FRD (Fr'echet Radiomic Distance)<n>We show that FRD is superior to other image distribution metrics for a range of medical imaging applications.<n> FRD offers additional benefits such as stability and computational efficiency at low sample sizes.
arXiv Detail & Related papers (2024-12-02T13:49:14Z)
Visual Prompt Engineering for Vision Language Models in Radiology [0.17183214167143138]
Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining.<n>While CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy.<n>We explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers, such as arrows, bounding boxes, and circles, directly into radiological images to guide model attention.
arXiv Detail & Related papers (2024-08-28T13:53:27Z)
A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis [48.84443450990355]
Deep networks have achieved broad success in analyzing natural images, when applied to medical scans, they often fail in unexcepted situations. We investigate this challenge and focus on model sensitivity to domain shifts, such as data sampled from different hospitals or data confounded by demographic variables such as sex, race, etc, in the context of chest X-rays and skin lesion images. Taking inspiration from medical training, we propose giving deep networks a prior grounded in explicit medical knowledge communicated in natural language.
arXiv Detail & Related papers (2024-05-23T17:55:02Z)
Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. All images in this benchmark are sourced from authentic medical scenarios. We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z)
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated. It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports. It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z)
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z)
Region-based Contrastive Pretraining for Medical Image Retrieval with Anatomic Query [56.54255735943497]
Region-based contrastive pretraining for Medical Image Retrieval (RegionMIR) We introduce a novel Region-based contrastive pretraining for Medical Image Retrieval (RegionMIR)
arXiv Detail & Related papers (2023-05-09T16:46:33Z)
An Interpretable Multiple-Instance Approach for the Detection of referable Diabetic Retinopathy from Fundus Images [72.94446225783697]
We propose a machine learning system for the detection of referable Diabetic Retinopathy in fundus images. By extracting local information from image patches and combining it efficiently through an attention mechanism, our system is able to achieve high classification accuracy. We evaluate our approach on publicly available retinal image datasets, in which it exhibits near state-of-the-art performance.
arXiv Detail & Related papers (2021-03-02T13:14:15Z)
Explaining Clinical Decision Support Systems in Medical Imaging using Cycle-Consistent Activation Maximization [112.2628296775395]
Clinical decision support using deep neural networks has become a topic of steadily growing interest. clinicians are often hesitant to adopt the technology because its underlying decision-making process is considered to be intransparent and difficult to comprehend. We propose a novel decision explanation scheme based on CycleGAN activation which generates high-quality visualizations of classifier decisions even in smaller data sets.
arXiv Detail & Related papers (2020-10-09T14:39:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.