Related papers: See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

URL: http://arxiv.org/abs/2506.18140v1
Date: Sun, 22 Jun 2025 18:59:44 GMT
Title: See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis
Authors: Ruinan Jin, Gexin Huang, Xinwei Shen, Qiong Zhang, Yan Shuo Tan, Xiaoxiao Li,
Abstract summary: Medical vision-language models (VLMs) focus primarily on single-image or single-series analyses.<n>We show that providing general-purpose VLMs with query and normative matched reference images, accompanied by clinically-informed comparative prompts, significantly improves diagnostic outcomes.
Score: 30.3617091206683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical imaging diagnosis presents inherent challenges due to diseases that mimic normal anatomy and exhibit significant inter-patient variability. Clinicians routinely employ comparative reasoning-using reference images from healthy controls or previous patient examinations-to discern subtle yet diagnostically critical abnormalities. However, existing medical vision-language models (VLMs) focus primarily on single-image or single-series analyses and lack explicit mechanisms for comparative reasoning. Conversely, general-purpose VLMs demonstrate strong multi-image comparative reasoning capabilities but lack essential medical-domain knowledge to identify nuanced clinical differences. This work aims to bridge this gap by exploring clinically-inspired comparative analysis within VLMs, leveraging reference images to enhance diagnostic accuracy. Through extensive empirical analysis, we show that providing general-purpose VLMs with query and normative matched reference images, accompanied by clinically-informed comparative prompts, significantly improves diagnostic outcomes compared to single-image baselines, especially after supervised finetuning (SFT). Our contributions highlight the clinical relevance of comparative analysis introduce novel strategies for leveraging reference images in VLMs, empirically demonstrate enhanced performance across multiple medical visual question answering (VQA) tasks, and provide theoretical insights into the efficacy of comparative image analysis in medical diagnosis.

Related papers

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z)
Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning [37.37330596550283]
We introduce a framework for reliable medical image diagnosis using vision-language models.<n>A test-time scaling strategy consolidates multiple candidate outputs into a reliable final diagnosis.<n>We evaluate our approach across various medical imaging modalities.
arXiv Detail & Related papers (2025-06-11T22:23:38Z)
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? [1.1094764204428438]
We propose DrVD-Bench, the first benchmark for clinical visual reasoning.<n>DrVD-Bench consists of three modules: Visual Evidence, Reasoning Trajectory Assessment, and Report Generation Evaluation.<n>Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology.
arXiv Detail & Related papers (2025-05-30T03:33:25Z)
Shifts in Doctors' Eye Movements Between Real and AI-Generated Medical Images [5.969442345531191]
Eye-tracking analysis plays a vital role in medical imaging, providing key insights into how radiologists visually interpret and diagnose clinical cases.<n>We first analyze radiologists' attention and agreement by measuring the distribution of various eye-movement patterns, including saccades direction, amplitude, and their joint distribution.<n>We investigate whether and how doctors' gaze behavior shifts when viewing authentic (Real) versus deep-learning-generated (Fake) images.
arXiv Detail & Related papers (2025-04-21T10:13:59Z)
A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images [11.761590928900358]
In ophthalmology, large language models (MLLMs) have been explored for analyzing optical coherence tomography ( OCT) reports.<n>Our dataset consists of 439 fundus images and 75 OCT images.<n>Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases.
arXiv Detail & Related papers (2025-03-10T09:19:55Z)
A Clinical-oriented Multi-level Contrastive Learning Method for Disease Diagnosis in Low-quality Medical Images [4.576524795036682]
Disease diagnosis methods guided by contrastive learning (CL) have shown significant advantages in lesion feature representation. We propose a clinical-oriented multi-level CL framework that aims to enhance the model's capacity to extract lesion features. The proposed CL framework is validated on two public medical image datasets, EyeQ and Chest X-ray.
arXiv Detail & Related papers (2024-04-07T09:08:14Z)
C^2M-DoT: Cross-modal consistent multi-view medical report generation with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT) C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z)
A Transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics [63.106382317917344]
We report a Transformer-based representation-learning model as a clinical diagnostic aid that processes multimodal input in a unified manner. The unified model outperformed an image-only model and non-unified multimodal diagnosis models in the identification of pulmonary diseases.
arXiv Detail & Related papers (2023-06-01T16:23:47Z)
Act Like a Radiologist: Towards Reliable Multi-view Correspondence Reasoning for Mammogram Mass Detection [49.14070210387509]
We propose an Anatomy-aware Graph convolutional Network (AGN) for mammogram mass detection. AGN is tailored for mammogram mass detection and endows existing detection methods with multi-view reasoning ability. Experiments on two standard benchmarks reveal that AGN significantly exceeds the state-of-the-art performance.
arXiv Detail & Related papers (2021-05-21T06:48:34Z)
Malignancy Prediction and Lesion Identification from Clinical Dermatological Images [65.1629311281062]
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images. We first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy.
arXiv Detail & Related papers (2021-04-02T20:52:05Z)
Semi-supervised Medical Image Classification with Relation-driven Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification. It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations. Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
Modeling and Enhancing Low-quality Retinal Fundus Images [167.02325845822276]
Low-quality fundus images increase uncertainty in clinical observation and lead to the risk of misdiagnosis. We propose a clinically oriented fundus enhancement network (cofe-Net) to suppress global degradation factors. Experiments on both synthetic and real images demonstrate that our algorithm effectively corrects low-quality fundus images without losing retinal details.
arXiv Detail & Related papers (2020-05-12T08:01:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.