TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models
- URL: http://arxiv.org/abs/2509.25143v1
- Date: Mon, 29 Sep 2025 17:51:26 GMT
- Title: TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models
- Authors: Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng,
- Abstract summary: Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
- Score: 54.48710348910535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.
Related papers
- MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study [16.84832179579428]
Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare.<n>We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, across eight benchmarks.<n>First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images.<n>Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support.
arXiv Detail & Related papers (2025-07-15T11:12:39Z) - DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? [1.1094764204428438]
We propose DrVD-Bench, the first benchmark for clinical visual reasoning.<n>DrVD-Bench consists of three modules: Visual Evidence, Reasoning Trajectory Assessment, and Report Generation Evaluation.<n>Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology.
arXiv Detail & Related papers (2025-05-30T03:33:25Z) - Medical Large Vision Language Models with Multi-Image Visual Ability [46.889345205047675]
We present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs.<n>We fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis.<n>We also develop the Med-MIM benchmark to evaluate the medical multi-image understanding capabilities of LVLMs.
arXiv Detail & Related papers (2025-05-25T08:31:22Z) - ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue [25.398370966763597]
In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition.
Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones.
We propose ZALM3, a Zero-shot strategy to improve vision-language alignment in Multi-turn Multimodal Medical dialogue.
arXiv Detail & Related papers (2024-09-26T07:55:57Z) - FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging [68.6715007665896]
FedMedICL is a unified framework and benchmark to holistically evaluate federated medical imaging challenges.
We comprehensively evaluate several popular methods on six diverse medical imaging datasets.
We find that a simple batch balancing technique surpasses advanced methods in average performance across FedMedICL experiments.
arXiv Detail & Related papers (2024-07-11T19:12:23Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - TransMed: Large Language Models Enhance Vision Transformer for
Biomedical Image Classification [11.202967500669402]
Few-shot learning has been studied to adapt models to tasks with very few samples.
We propose a novel approach that contextualizes labels via large language models (LLMs)
Our findings reveal that the context generated by LLMs significantly enhances the discrimination of semantic embeddings for similar categories.
arXiv Detail & Related papers (2023-12-12T09:58:07Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.