First, do NOHARM: towards clinically safe large language models
- URL: http://arxiv.org/abs/2512.01241v1
- Date: Mon, 01 Dec 2025 03:33:16 GMT
- Title: First, do NOHARM: towards clinically safe large language models
- Authors: David Wu, Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Ernest Y. Lee, Allen Shih, Vartan Pahalyants, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Adam Rodman, Jonathan H. Chen, Ethan Goh,
- Abstract summary: We present NOHARM, a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity.<n>Across 31 large language models (LLMs), severe harm occurs in up to 22.2% of cases, with harms of omission accounting for 76.6%.<n>The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models.
- Score: 4.4072363018342005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.
Related papers
- PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z) - LiveClin: A Live Clinical Benchmark without Leakage [50.45415584327275]
LiveClin is a live benchmark designed for approximating real-world clinical practice.<n>We transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway.<n>Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%.
arXiv Detail & Related papers (2026-02-18T03:59:46Z) - Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems [19.880569341968023]
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety.<n>We propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics.
arXiv Detail & Related papers (2026-01-21T16:40:41Z) - MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications [27.73095565539546]
We introduce a Medical LLM Benchmark MLB, a benchmark evaluating Large Language Models (LLMs) on both foundational knowledge and scenario-based reasoning.<n>MLB is structured around five core dimensions: Medical Knowledge (MedKQA), Safety and Ethics (MedSE), Medical Record Understanding (MedRU), Smart Services (SmartServ), and Smart Healthcare (SmartCare)<n>Its design features a rigorous curation pipeline involving 300 licensed physicians. Besides, we provide a scalable evaluation methodology, centered on a specialized judge model trained via Supervised Fine-Tuning (SFT) on expert annotations.
arXiv Detail & Related papers (2026-01-08T02:41:42Z) - A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care [5.167350493769989]
This is the first evaluation of an LLM-based medication safety review system on real NHS primary care data.<n>We strategically sampled patients to capture a broad range of clinical complexity and medication safety risk.<n>Our primary LLM system showed strong performance in recognising when a clinical issue is present.
arXiv Detail & Related papers (2025-12-24T11:58:49Z) - EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z) - MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z) - A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains [15.73821689524201]
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation.<n>We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus.<n>Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios.
arXiv Detail & Related papers (2025-07-31T12:10:00Z) - Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z) - Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [48.096652370210016]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z) - RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction [27.520717720270415]
We present the RiskAgent system to perform a broad range of medical risk predictions.<n>RiskAgent covers over 387 risk scenarios across diverse complex diseases, e.g., cardiovascular disease and cancer.<n>We have built the first benchmark MedRisk specialized for risk prediction, including 12,352 questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems.
arXiv Detail & Related papers (2025-03-05T18:46:51Z) - Medical Hallucinations in Foundation Models and Their Impact on Healthcare [71.15392179084428]
Hallucinations in foundation models arise from autoregressive training objectives.<n>Top-performing models exceeded 97% accuracy when augmented with chain-of-thought prompting.
arXiv Detail & Related papers (2025-02-26T02:30:44Z) - Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging [47.99192239793597]
We evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.
Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
arXiv Detail & Related papers (2023-02-03T09:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.