Related papers: Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024)

Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024)

URL: http://arxiv.org/abs/2309.02233v3
Date: Mon, 07 Oct 2024 17:21:45 GMT
Title: Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024)
Authors: Yubo Wang, Xueguang Ma, Wenhu Chen,
Abstract summary: We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules. We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
Score: 48.17095875619711
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale language models (LLMs) like ChatGPT have demonstrated impressive abilities in generating responses based on human instructions. However, their use in the medical field can be challenging due to their lack of specific, in-depth knowledge. In this study, we present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) designed to enhance the proficiency of LLMs in specialized domains. LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules. These modules include a Query Augmenter, a Hybrid Textbook Retriever, and a Knowledge Self-Refiner. Together, they incorporate authoritative medical knowledge. Additionally, an LLM Reader aids in contextual understanding. Our experimental results on three medical QA tasks demonstrate that LLMAMT significantly improves response quality, with accuracy gains ranging from 11.6% to 16.6%. Notably, with GPT-4-Turbo as the base model, LLM-AMT outperforms the specialized Med-PaLM 2 model pre-trained on a massive amount of medical corpus by 2-3%. We found that despite being 100x smaller in size, medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain, boosting performance by 7.8%-13.7%.

Related papers

MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration [57.98393950821579]
We introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM)<n>Inspired by our empirical findings, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director.<n>This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases.
arXiv Detail & Related papers (2025-06-24T17:52:43Z)
MIRIAD: Augmenting LLMs with millions of medical query-response pairs [36.32674607022871]
We introduce MIRIAD, a large-scale, curated corpus of 5,821,948 medical QA pairs.<n>We show that MIRIAD improves accuracy up to 6.7% compared to unstructured RAG baselines.<n>We also introduce MIRIAD-Atlas, an interactive map of MIRIAD spanning 56 medical disciplines.
arXiv Detail & Related papers (2025-06-06T13:52:32Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
MedG-KRP: Medical Graph Knowledge Representation Probing [0.6496030410305753]
Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications. We introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs. We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model.
arXiv Detail & Related papers (2024-12-14T22:23:20Z)
MEG: Medical Knowledge-Augmented Large Language Models for Question Answering [37.3562521243773]
We present MEG, a parameter-efficient approach for medical knowledge-augmented LLMs. We evaluate our method on four popular medical multiple-choice datasets.
arXiv Detail & Related papers (2024-11-06T12:57:58Z)
SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation [50.26966969163348]
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG) Existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries. We propose Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm.
arXiv Detail & Related papers (2024-06-17T06:48:31Z)
MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering [5.065947993017158]
Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. We examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews.
arXiv Detail & Related papers (2024-06-09T16:33:28Z)
OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models [4.556924372105915]
Open-source (OS) models represent a key area of growth for medical LLMs. We present OpenMedLM, a prompting platform which delivers state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks.
arXiv Detail & Related papers (2024-02-29T17:19:39Z)
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models [91.25119823784705]
Large language models (LLMs) can potentially democratize access to medical knowledge. We release MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
arXiv Detail & Related papers (2023-11-27T18:49:43Z)
ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain. ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF. We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z)
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language. This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z)
Integrating UMLS Knowledge into Large Language Models for Medical Question Answering [18.06960842747575]
Large language models (LLMs) have demonstrated powerful text generation capabilities, bringing unprecedented innovation to the healthcare field. We develop an augmented LLM framework based on the Unified Medical Language System (UMLS), aiming to better serve the healthcare community. We employ LLaMa2-13b-chat and ChatGPT-3.5 as our benchmark models, and conduct automatic evaluations using the ROUGE Score and BERTScore on 104 questions from the LiveQA test set.
arXiv Detail & Related papers (2023-10-04T12:50:26Z)
MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering [45.84961106102445]
Large Language Models (LLMs) often perform poorly on domain-specific tasks such as medical question answering (QA) We propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the LLM's query prompt. Our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%.
arXiv Detail & Related papers (2023-09-27T21:26:03Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.