RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- URL: http://arxiv.org/abs/2407.05131v2
- Date: Thu, 17 Oct 2024 01:55:28 GMT
- Title: RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- Authors: Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, Huaxiu Yao,
- Abstract summary: Current Medical Large Vision Language Models (Med-LVLMs) frequently encounter factual issues.
RAG, which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges.
We propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the selection of retrieved contexts.
Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model.
- Score: 35.60385437194243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model's generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RULE on medical VQA and report generation tasks across three datasets, achieving an average improvement of 47.4% in factual accuracy. We publicly release our benchmark and code in https://github.com/richard-peng-xia/RULE.
Related papers
- Enhancing Health Information Retrieval with RAG by Prioritizing Topical Relevance and Factual Accuracy [0.7673339435080445]
This paper introduces a solution driven by Retrieval-Augmented Generation (RAG) to enhance the retrieval of health-related documents grounded in scientific evidence.
In particular, we propose a three-stage model: in the first stage, the user's query is employed to retrieve topically relevant passages with associated references from a knowledge base constituted by scientific literature.
In the second stage, these passages, alongside the initial query, are processed by LLMs to generate a contextually relevant rich text (GenText)
In the last stage, the documents to be retrieved are evaluated and ranked both from the point of
arXiv Detail & Related papers (2025-02-07T05:19:13Z) - HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation [89.3260120072177]
We propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for Radiology report generation.
Our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression.
Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models.
arXiv Detail & Related papers (2024-12-15T06:04:16Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Rationale-Guided Retrieval Augmented Generation for Medical Question Answering [18.8818391508042]
Large language models (LLM) hold significant potential for applications in biomedicine.
RAG$2$ is a new framework for enhancing the reliability of RAG in biomedical contexts.
arXiv Detail & Related papers (2024-11-01T01:40:23Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.
Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.
We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study [4.769236554995528]
We propose a retrieval-augmented generation architecture for medical question answering on emerging issues associated with health-related topics.
Our framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data.
Our framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO.
arXiv Detail & Related papers (2024-05-29T20:56:52Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.