Related papers: Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

URL: http://arxiv.org/abs/2411.09213v1
Date: Thu, 14 Nov 2024 06:19:18 GMT
Title: Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
Authors: Nghia Trung Ngo, Chien Van Nguyen, Franck Dernoncourt, Thien Huu Nguyen,
Abstract summary: Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
Score: 70.44269982045415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.

Related papers

MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine [3.615835506868351]
We introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages.<n>We also develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components.
arXiv Detail & Related papers (2026-01-23T07:07:13Z)
Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models [15.530083855947987]
We propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR.<n>Med-RwR actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning.<n> Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models.
arXiv Detail & Related papers (2025-10-21T05:18:18Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z)
Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation [0.0]
This research paper investigates the application of Large Language Models (LLMs) in healthcare.<n>We focus on enhancing medical decision support through Retrieval-Augmented Generation (RAG) integrated with hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation (QLoRA)<n>We touch on the ethical considerations-patient privacy, data security, and the need for rigorous clinical validation-as well as the practical challenges of integrating such systems into real-world healthcare.
arXiv Detail & Related papers (2025-05-06T10:31:54Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance. We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering. We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system. Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses. We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs) We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions [42.73799041840482]
i-MedRAG is a system that iteratively asks follow-up queries based on previous information-seeking attempts. Our zero-shot i-MedRAG outperforms all existing prompt engineering and fine-tuning methods on GPT-3.5. i-MedRAG can flexibly ask follow-up queries to form reasoning chains, providing an in-depth analysis of medical questions.
arXiv Detail & Related papers (2024-08-01T17:18:17Z)
Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models [10.04914417538886]
Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. We propose a new textitDistill-Retrieve-Read framework instead of the previous textitRetrieve-then-Read.
arXiv Detail & Related papers (2024-04-27T13:11:42Z)
Benchmarking Retrieval-Augmented Generation for Medicine [30.390132015614128]
Large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. We propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets.
arXiv Detail & Related papers (2024-02-20T17:44:06Z)
Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs) Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities. We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z)
MedLM: Exploring Language Models for Medical Question Answering Systems [2.84801080855027]
Large Language Models (LLMs) with their advanced generative capabilities have shown promise in various NLP tasks. This study aims to compare the performance of general and medical-specific distilled LMs for medical Q&A. The findings will provide valuable insights into the suitability of different LMs for specific applications in the medical domain.
arXiv Detail & Related papers (2024-01-21T03:37:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.