Related papers: Med-R$^2$: Crafting Trustworthy LLM Physicians via Retrieval and Reasoning of Evidence-Based Medicine

Med-R$^2$: Crafting Trustworthy LLM Physicians via Retrieval and Reasoning of Evidence-Based Medicine

URL: http://arxiv.org/abs/2501.11885v5
Date: Thu, 09 Oct 2025 07:42:57 GMT
Title: Med-R$^2$: Crafting Trustworthy LLM Physicians via Retrieval and Reasoning of Evidence-Based Medicine
Authors: Keer Lu, Zheng Liang, Da Pan, Shusen Zhang, Guosheng Dong, Zhonghai Wu, Huang Leng, Bin Cui, Wentao Zhang,
Abstract summary: Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios.<n>We introduce Med-R2, a novel framework that adheres to the Evidence-Based Medicine (EBM) process.<n>Our experiments indicate that Med-R2 achieves a 13.27% improvement over vanilla RAG methods and even a 4.55% enhancement compared to fine-tuning strategies.
Score: 25.38817351839917
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. Despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 13.27\% improvement over vanilla RAG methods and even a 4.55\% enhancement compared to fine-tuning strategies, without incurring additional training costs. Furthermore, we find that our LLaMA3.1-70B + Med-R$^2$ surpasses frontier models, including GPT-4o, Claude3.5-Sonnet and DeepSeek-V3 by 1.05\%, 6.14\% and 1.91\%. Med-R$^2$ effectively enhances the capabilities of LLMs in the medical domain.

Related papers

Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework [0.0]
We present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions.<n>We fine-tune two state-of-the-art open LLMs (LLaMA2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization.<n>Our fine-tuned LLaMA2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline.
arXiv Detail & Related papers (2025-12-05T16:38:47Z)
SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models [10.761477571508253]
Large language models (LLMs) have achieved great success in medical question answering and clinical decision-making.<n>Traditional fine-tuning strategies on LLM require the updates of billions of parameters, substantially increasing the training cost.<n>We craft a novel sparse medical LLM named SparseDoctor armed with contrastive learning enhanced LoRA-MoE architecture.
arXiv Detail & Related papers (2025-09-15T11:25:14Z)
Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning [25.67288561609553]
We introduce **Med-R$3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning.<n>In this framework, we first develop the model's ability to perform logical reasoning over medical problems.<n>We then adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization.
arXiv Detail & Related papers (2025-07-31T13:31:01Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance. We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations [5.265452667976959]
This survey systematically summarizes how to train medical LLMs based on open-source general LLMs. It covers (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose an appropriate training paradigm, and (d) existing challenges and promising research directions.
arXiv Detail & Related papers (2024-06-14T02:42:20Z)
JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability [8.476124605775976]
Large Language Models (LLMs) have demonstrated a remarkable potential in medical knowledge acquisition and question-answering. LLMs can potentially hallucinate and yield factually incorrect outcomes, even with domain-specific pretraining. We introduce JMLR (for Jointly trains LLM and information Retrieval) during the fine-tuning phase to address hallucinations.
arXiv Detail & Related papers (2024-02-27T21:01:41Z)
Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs) Our research aims to transform existing medication recommendation methodologies using LLMs. To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z)
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge. We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z)
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language. This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z)
MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering [45.84961106102445]
Large Language Models (LLMs) often perform poorly on domain-specific tasks such as medical question answering (QA) We propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the LLM's query prompt. Our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%.
arXiv Detail & Related papers (2023-09-27T21:26:03Z)
Aligning Large Language Models for Clinical Tasks [0.0]
Large Language Models (LLMs) have demonstrated remarkable adaptability, showcasing their capacity to excel in tasks for which they were not explicitly trained. We propose an alignment strategy for medical question-answering, known as 'expand-guess-refine' A preliminary analysis of this method demonstrated outstanding performance, achieving a score of 70.63% on a subset of questions sourced from the USMLE dataset.
arXiv Detail & Related papers (2023-09-06T10:20:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.