Related papers: Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for Medical Diagnosis

Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for Medical Diagnosis

URL: http://arxiv.org/abs/2508.17394v2
Date: Sun, 31 Aug 2025 20:43:36 GMT
Title: Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for Medical Diagnosis
Authors: Nir Mazor, Tom Hope,
Abstract summary: We develop a model in which a multimodal retriever is jointly optimized with an LVLM for medical diagnosis.<n>We show that using only general-purpose backbones, with only lightweight fine-tuning, our model is able to achieve competitive results.
Score: 9.248806116103605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clinical decision-making often involves interpreting images (e.g., radiology) for making diagnoses. Retrieving relevant visual information from medical literature and hospital records could enhance diagnostic accuracy. In this paper, we develop a model in which a multimodal retriever is jointly optimized with an LVLM for medical diagnosis, unlike standard RAG where LVLM error signal is not propagated down to the retriever. We show that using only general-purpose backbones, with only lightweight fine-tuning, our model is able to achieve competitive results with medically-pretrained models across clinical multi-label classification and visual question answering tasks. In a novel analysis, we additionally find that in many cases different top retrieved images each lead to different predictions for a given target, and that these cases are empirically challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these challenging cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap -- leaving ample room for improvement by future methods. Code will be made publicly available.

Related papers

Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models [19.558012394552954]
We present the first comprehensive examination of diverse DPO variants within the medical domain.<n>Current DPO approaches often yield inconsistent gains over supervised fine-tuning.<n>They frequently fail to resolve fundamental visual misinterpretation errors.
arXiv Detail & Related papers (2026-01-25T17:36:53Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification [15.98427699337596]
We perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification.<n>We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques.
arXiv Detail & Related papers (2025-01-31T12:23:50Z)
HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation [89.3260120072177]
We propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for Radiology report generation.<n>Our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression.<n> Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models.
arXiv Detail & Related papers (2024-12-15T06:04:16Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
Deep Learning with HM-VGG: AI Strategies for Multi-modal Image Analysis [10.01246918773756]
This study introduces the Hybrid Multi-modal VGG model, a cutting-edge deep learning approach for the early diagnosis of glaucoma. The model's performance is underscored by its high metrics in Precision, Accuracy, and F1-Score. The HM-VGG model offers a promising tool for doctors, streamlining the diagnostic process and improving patient outcomes.
arXiv Detail & Related papers (2024-10-31T15:42:24Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models [35.60385437194243]
Current Medical Large Vision Language Models (Med-LVLMs) frequently encounter factual issues. RAG, which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. We propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the selection of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model.
arXiv Detail & Related papers (2024-07-06T16:45:07Z)
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z)
MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification [41.16626194300303]
Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications. Recent advances further enable adapting foundation models in downstream tasks efficiently using only a few training samples. Yet, the application of such learning paradigms in medical image analysis remains scarce due to the shortage of publicly accessible data and benchmarks.
arXiv Detail & Related papers (2023-06-16T01:46:07Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Modality Completion via Gaussian Process Prior Variational Autoencoders for Multi-Modal Glioma Segmentation [75.58395328700821]
We propose a novel model, Multi-modal Gaussian Process Prior Variational Autoencoder (MGP-VAE), to impute one or more missing sub-modalities for a patient scan. MGP-VAE can leverage the Gaussian Process (GP) prior on the Variational Autoencoder (VAE) to utilize the subjects/patients and sub-modalities correlations. We show the applicability of MGP-VAE on brain tumor segmentation where either, two, or three of four sub-modalities may be missing.
arXiv Detail & Related papers (2021-07-07T19:06:34Z)
Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance. For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming. In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.