Related papers: MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

URL: http://arxiv.org/abs/2508.02951v1
Date: Mon, 04 Aug 2025 23:19:18 GMT
Title: MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine
Authors: Mahtab Bigverdi, Wisdom Ikezogwo, Kevin Zhang, Hyewon Jeong, Mingyu Lu, Sungjae Cho, Linda Shapiro, Ranjay Krishna,
Abstract summary: We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities.<n>Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images.<n>While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%.
Score: 12.333678882957377
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.

Related papers

Navigating Gigapixel Pathology Images with Large Multimodal Models [0.649324006529432]
General-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation.<n>We introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images like a pathologist.<n>Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines.
arXiv Detail & Related papers (2025-11-24T19:33:56Z)
Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks [54.00822479127598]
We introduce a medical vision-language task named Medical Diagnosis (MDS)<n>MDS aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results.<n>We propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation.
arXiv Detail & Related papers (2025-11-10T03:22:42Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z)
MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text [25.102399692530245]
We introduce MedAtlas, a novel benchmark framework to evaluate large language models on realistic medical reasoning tasks.<n> MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity.<n>Each case is derived from real diagnostic and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray.
arXiv Detail & Related papers (2025-08-13T17:32:17Z)
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards [21.831262938278915]
We introduce Med-PRM, a process reward modeling framework to verify each reasoning step against established medical knowledge bases.<n>Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50%.<n>We demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat.
arXiv Detail & Related papers (2025-06-13T05:36:30Z)
Medical Large Vision Language Models with Multi-Image Visual Ability [46.889345205047675]
We present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs.<n>We fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis.<n>We also develop the Med-MIM benchmark to evaluate the medical multi-image understanding capabilities of LVLMs.
arXiv Detail & Related papers (2025-05-25T08:31:22Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
One-shot Localization and Segmentation of Medical Images with Foundation Models [7.9060536840474365]
We show that the models trained on natural images can offer good performance on medical images. We leverage the correspondence with respect to a template image to prompt a Segment Anything (SAM) model to arrive at single shot segmentation. We also show that our single-shot method outperforms the recently proposed few-shot segmentation method - UniverSeg.
arXiv Detail & Related papers (2023-10-28T08:58:20Z)
A Transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics [63.106382317917344]
We report a Transformer-based representation-learning model as a clinical diagnostic aid that processes multimodal input in a unified manner. The unified model outperformed an image-only model and non-unified multimodal diagnosis models in the identification of pulmonary diseases.
arXiv Detail & Related papers (2023-06-01T16:23:47Z)
Segment Anything in Medical Images [21.43661408153244]
We present MedSAM, a foundation model designed for enabling universal medical image segmentation. The model is developed on a large-scale medical image dataset with 1,570,263 image-mask pairs, covering 10 imaging modalities and over 30 cancer types.
arXiv Detail & Related papers (2023-04-24T17:56:12Z)
Ambiguous Medical Image Segmentation using Diffusion Models [60.378180265885945]
We introduce a single diffusion model-based approach that produces multiple plausible outputs by learning a distribution over group insights. Our proposed model generates a distribution of segmentation masks by leveraging the inherent sampling process of diffusion. Comprehensive results show that our proposed approach outperforms existing state-of-the-art ambiguous segmentation networks.
arXiv Detail & Related papers (2023-04-10T17:58:22Z)
Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.