RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation
- URL: http://arxiv.org/abs/2511.04328v1
- Date: Thu, 06 Nov 2025 12:56:34 GMT
- Title: RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation
- Authors: Jiahao Zhao, Luxin Xu, Minghuan Tan, Lichao Zhang, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang,
- Abstract summary: Large Language Models (LLMs) have achieved remarkable progress in diverse healthcare tasks.<n>However, research on their medication safety remains limited due to the lack of real world datasets.<n>We propose a framework that simulates and evaluates clinical consultations to systematically assess the medication safety capabilities of LLMs.
- Score: 19.41567007880886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous medical systems powered by Large Language Models (LLMs) have achieved remarkable progress in diverse healthcare tasks. However, research on their medication safety remains limited due to the lack of real world datasets, constrained by privacy and accessibility issues. Moreover, evaluation of LLMs in realistic clinical consultation settings, particularly regarding medication safety, is still underexplored. To address these gaps, we propose a framework that simulates and evaluates clinical consultations to systematically assess the medication safety capabilities of LLMs. Within this framework, we generate inquiry diagnosis dialogues with embedded medication risks and construct a dedicated medication safety database, RxRisk DB, containing 6,725 contraindications, 28,781 drug interactions, and 14,906 indication-drug pairs. A two-stage filtering strategy ensures clinical realism and professional quality, resulting in the benchmark RxSafeBench with 2,443 high-quality consultation scenarios. We evaluate leading open-source and proprietary LLMs using structured multiple choice questions that test their ability to recommend safe medications under simulated patient contexts. Results show that current LLMs struggle to integrate contraindication and interaction knowledge, especially when risks are implied rather than explicit. Our findings highlight key challenges in ensuring medication safety in LLM-based systems and provide insights into improving reliability through better prompting and task-specific tuning. RxSafeBench offers the first comprehensive benchmark for evaluating medication safety in LLMs, advancing safer and more trustworthy AI-driven clinical decision support.
Related papers
- SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond [134.43113804188195]
We introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts.<n>SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement.
arXiv Detail & Related papers (2026-03-02T08:16:04Z) - MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs [2.2090506971647144]
Medical Prompt Injection Benchmark (MPIB) is a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection.<n>MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events.<n>We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection.
arXiv Detail & Related papers (2026-02-06T00:03:09Z) - A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care [5.167350493769989]
This is the first evaluation of an LLM-based medication safety review system on real NHS primary care data.<n>We strategically sampled patients to capture a broad range of clinical complexity and medication safety risk.<n>Our primary LLM system showed strong performance in recognising when a clinical issue is present.
arXiv Detail & Related papers (2025-12-24T11:58:49Z) - MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs [7.2159153945746795]
Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap.<n>We introduce MediEval, a benchmark that links MIMIC-IV electronic health records to a unified knowledge base built from UMLS and other biomedical vocabularies.<n> MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework.
arXiv Detail & Related papers (2025-12-23T22:52:24Z) - Exploring Membership Inference Vulnerabilities in Clinical Large Language Models [42.52690697965999]
We present an exploratory empirical study on membership inference vulnerabilities in clinical large language models (LLMs)<n>Using a state-of-the-art clinical question-answering model, Llemr, we evaluate both canonical loss-based attacks and a domain-motivated paraphrasing-based perturbation strategy.<n>Results motivate continued development of context-aware, domain-specific privacy evaluations and defenses.
arXiv Detail & Related papers (2025-10-21T14:27:48Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains [15.73821689524201]
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation.<n>We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus.<n>Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios.
arXiv Detail & Related papers (2025-07-31T12:10:00Z) - Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation [9.84660526673816]
This study investigated the feasibility and value of using a Large Language Model (LLM)-based multi-agent system for safer therapy recommendations.<n>We designed a single agent and a MAS framework simulating multidisciplinary team (MDT) decision-making.<n>We compared MAS performance with single-agent approaches and real-world benchmarks.
arXiv Detail & Related papers (2025-07-15T02:01:38Z) - Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [48.096652370210016]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Can LLMs Support Medical Knowledge Imputation? An Evaluation-Based Perspective [1.4913052010438639]
We have explored the use of Large Language Models (LLMs) for imputing missing treatment relationships.<n>LLMs offer promising capabilities in knowledge augmentation, but their application in medical knowledge imputation presents significant risks.<n>Our findings highlight critical limitations, including inconsistencies with established clinical guidelines and potential risks to patient safety.
arXiv Detail & Related papers (2025-03-29T02:52:17Z) - A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare [8.378348088931578]
The application of large language models (LLMs) in healthcare holds significant promise for enhancing clinical decision-making, medical research, and patient care.<n>Their integration into real-world clinical settings raises critical concerns around trustworthiness, particularly around dimensions of truthfulness, privacy, safety, robustness, fairness, and explainability.
arXiv Detail & Related papers (2025-02-21T18:43:06Z) - SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs [78.99703366417661]
Large language models (LLMs) increasingly assist in tasks ranging from procedural guidance to autonomous experiment orchestration.<n>Such overreliance is particularly dangerous in high-stakes laboratory settings, where failures in hazard identification or risk assessment can result in severe accidents.<n>We propose the Laboratory Safety Benchmark (LabSafety Bench) to evaluate models on their ability to identify potential hazards, assess risks, and predict the consequences of unsafe actions in lab environments.
arXiv Detail & Related papers (2024-10-18T05:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.