HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks
- URL: http://arxiv.org/abs/2508.12778v1
- Date: Mon, 18 Aug 2025 09:54:10 GMT
- Title: HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks
- Authors: Zhe Chen, Yusheng Liao, Shuyang Jiang, Zhiyuan Zhu, Haolin Li, Yanfeng Wang, Yu Wang,
- Abstract summary: We present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources.<n>HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks.
- Score: 22.597677744620295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While retrieval-augmented generation has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports affects the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for dynamically constructing queries for diverse corpora. Incorporating knowledge from such multifaceted sources, Med-LVLM is then trained with Heterogeneous Knowledge Preference Tuning to achieve cross-modality and multi-source knowledge alignment. Extensive experiments across 12 datasets and 3 modalities demonstrate that the proposed HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks, significantly improving factual accuracy and reliability of Med-LVLMs.
Related papers
- MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs [21.189398460029008]
MedXIAOHE is a medical vision-language foundation model designed to advance medical understanding and reasoning in real-world clinical applications.<n>We propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage.<n>For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training.
arXiv Detail & Related papers (2026-02-13T08:19:38Z) - MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models [15.530083855947987]
We propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR.<n>Med-RwR actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning.<n> Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models.
arXiv Detail & Related papers (2025-10-21T05:18:18Z) - MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration [57.98393950821579]
We introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM)<n>Inspired by our empirical findings, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director.<n>This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases.
arXiv Detail & Related papers (2025-06-24T17:52:43Z) - A Multimodal Multi-Agent Framework for Radiology Report Generation [2.1477122604204433]
Radiology report generation (RRG) aims to automatically produce diagnostic reports from medical images.<n>We propose a multimodal multi-agent framework for RRG that aligns with the stepwise clinical reasoning workflow.
arXiv Detail & Related papers (2025-05-14T20:28:04Z) - Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications [21.534794098692842]
Large language models hold promise for addressing medical challenges, such as medical diagnosis reasoning, research knowledge acquisition, clinical decision-making, and consumer health inquiry support.<n>We address this challenge by framing it as a source planning problem, which is to formulate context-appropriate queries tailored to the attributes of diverse sources.<n>Existing approaches either overlook source planning or fail to achieve it effectively due to misalignment between the model's expectation of the sources and their actual content.
arXiv Detail & Related papers (2025-01-05T07:03:14Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records
Analysis via Large Language Models [19.62552013839689]
Existing models often lack the medical context relevent to clinical tasks, prompting the incorporation of external knowledge.
We propose REALM, a Retrieval-Augmented Generation (RAG) driven framework to enhance multimodal EHR representations.
Our experiments on MIMIC-III mortality and readmission tasks showcase the superior performance of our REALM framework over baselines.
arXiv Detail & Related papers (2024-02-10T18:27:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.