Benchmarking Retrieval-Augmented Generation for Medicine
- URL: http://arxiv.org/abs/2402.13178v2
- Date: Fri, 23 Feb 2024 16:46:58 GMT
- Title: Benchmarking Retrieval-Augmented Generation for Medicine
- Authors: Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang
- Abstract summary: Large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks.
Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted.
We propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets.
- Score: 30.390132015614128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models (LLMs) have achieved state-of-the-art performance
on a wide range of medical question answering (QA) tasks, they still face
challenges with hallucinations and outdated knowledge. Retrieval-augmented
generation (RAG) is a promising solution and has been widely adopted. However,
a RAG system can involve multiple flexible components, and there is a lack of
best practices regarding the optimal RAG setting for various medical purposes.
To systematically evaluate such systems, we propose the Medical Information
Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind
benchmark including 7,663 questions from five medical QA datasets. Using
MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt
tokens on 41 combinations of different corpora, retrievers, and backbone LLMs
through the MedRAG toolkit introduced in this work. Overall, MedRAG improves
the accuracy of six different LLMs by up to 18% over chain-of-thought
prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our
results show that the combination of various medical corpora and retrievers
achieves the best performance. In addition, we discovered a log-linear scaling
property and the "lost-in-the-middle" effects in medical RAG. We believe our
comprehensive evaluations can serve as practical guidelines for implementing
RAG systems for medicine.
Related papers
- Rationale-Guided Retrieval Augmented Generation for Medical Question Answering [18.8818391508042]
Large language models (LLM) hold significant potential for applications in biomedicine.
RAG$2$ is a new framework for enhancing the reliability of RAG in biomedical contexts.
arXiv Detail & Related papers (2024-11-01T01:40:23Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.
Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.
We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [61.14660526363607]
We propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules.
RAGChecker has significantly better correlations with human judgments than other evaluation metrics.
The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.
arXiv Detail & Related papers (2024-08-15T10:20:54Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions [42.73799041840482]
i-MedRAG is a system that iteratively asks follow-up queries based on previous information-seeking attempts.
Our zero-shot i-MedRAG outperforms all existing prompt engineering and fine-tuning methods on GPT-3.5.
i-MedRAG can flexibly ask follow-up queries to form reasoning chains, providing an in-depth analysis of medical questions.
arXiv Detail & Related papers (2024-08-01T17:18:17Z) - SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG)
Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries.
We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.