DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline
- URL: http://arxiv.org/abs/2512.14896v1
- Date: Tue, 16 Dec 2025 20:19:23 GMT
- Title: DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline
- Authors: Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad, Parham Abed Azad, Ali Sabzi, Armin Khosravi, Siavash Ahmadi, Mohammad Hossein Rohban, Glolamali Aminian, Tahereh Javaheri,
- Abstract summary: We benchmarked eleven existing large language models (LLMs) with varying parameter sizes using a 141-question pharmacy dataset.<n>We developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context.<n>DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark.
- Score: 4.750574899254107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy. Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters. Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark. Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.
Related papers
- Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines [1.9615061725959186]
This paper presents the development and evaluation of a Retrieval-Augmented Generation system for querying the United Kingdom's National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs)<n>The system's retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a database of 10,195 text chunks derived from three hundred guidelines.<n>It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries.
arXiv Detail & Related papers (2025-10-03T12:57:13Z) - GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [194.64264251080454]
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters.<n>Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding tasks.<n>We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems.
arXiv Detail & Related papers (2025-08-08T17:21:06Z) - Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications [0.0]
The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges.<n>This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline.
arXiv Detail & Related papers (2025-08-08T09:09:03Z) - DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery [54.79763887844838]
Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution.<n>We introduce DrugPilot, a LLM-based agent system with a parameterized reasoning architecture designed for end-to-end scientific in drug discovery.<n>DrugPilot significantly outperforms state-of-the-art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi-tool, and multi-turn scenarios, respectively.
arXiv Detail & Related papers (2025-05-20T05:18:15Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison [3.2627279988912194]
Large Language Models (LLMs) have revolutionized various domains, but their potential in pharmaceutical research remains largely unexplored.<n>This study thoroughly investigates LLMs' capabilities in predicting drug-drug interactions (DDIs)<n>We evaluated 18 different LLMs, including proprietary models (GPT-4, Claude, Gemini) and open-source variants (from 1.5B to 72B parameters)<n>Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of 0.919 on balanced datasets.
arXiv Detail & Related papers (2025-02-09T09:58:12Z) - Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models [0.0]
Sporo Health's AI Scribe is a proprietary model fine-tuned for medical scribing.
We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth.
Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance.
arXiv Detail & Related papers (2024-11-11T04:45:48Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - Development and Testing of Retrieval Augmented Generation in Large
Language Models -- A Case Study Report [2.523433459887027]
Retrieval Augmented Generation (RAG) emerges as a promising approach for customizing domain knowledge in Large Language Models (LLMs)
We developed an LLM-RAG model using 35 preoperative guidelines and tested it against human-generated responses.
The model generated answers within an average of 15-20 seconds, significantly faster than the 10 minutes typically required by humans.
arXiv Detail & Related papers (2024-01-29T06:49:53Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter
Encoders for Natural Language Understanding Systems [63.713297451300086]
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B.
Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
arXiv Detail & Related papers (2022-06-15T20:44:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.