Related papers: Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

URL: http://arxiv.org/abs/2512.01191v1
Date: Mon, 01 Dec 2025 02:14:43 GMT
Title: Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks
Authors: Krithik Vishwanath, Mrigayu Ghosh, Anton Alyakin, Daniel Alexander Alber, Yindalon Aphinyanaphongs, Eric Karl Oermann,
Abstract summary: Generalist models consistently outperformed clinical tools.<n>OpenEvidence and UpToDate Expert AI demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning.
Score: 1.2773749417703923
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.

Related papers

MedConsultBench: A Full-Cycle, Fine-Grained, Process-Aware Benchmark for Medical Consultation Agents [10.109613967215447]
We propose MedConsultBench, a comprehensive framework designed to evaluate the complete online consultation cycle.<n>Our methodology introduces Atomic Information Units (AIUs) to track clinical information acquisition at a sub-turn level.<n>By addressing the underspecification and ambiguity inherent in online consultations, the benchmark evaluates uncertainty-aware yet concise inquiry.
arXiv Detail & Related papers (2026-01-19T02:18:10Z)
The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency [38.68458713626548]
Current benchmarks fail to capture the integrated, multimodal reasoning essential for real-world patient care.<n>This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway.<n>Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning.
arXiv Detail & Related papers (2025-12-25T03:33:22Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries [2.2807344448218507]
We evaluate our agentic, RAG-based clinical support assistant, DR.INFO, using HealthBench.<n>On the Hard subset of 1,000 challenging examples, DR.INFO achieves a HealthBench score of 0.51.<n>In a separate 100-sample evaluation against similar agentic RAG assistants, it maintains a performance lead with a health-bench score of 0.54.
arXiv Detail & Related papers (2025-08-29T09:51:41Z)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z)
CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering [0.5310914438304387]
We introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA)<n>CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries.<n>Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18% improvement in diagnostic accuracy over the strongest baseline.
arXiv Detail & Related papers (2025-08-25T19:22:16Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z)
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use [10.565661515629412]
MedBrowseComp is a benchmark that systematically tests an agent's ability to retrieve and synthesize medical facts.<n>It contains more than 1,000 human-curated questions that mirror clinical scenarios.<n>Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent.
arXiv Detail & Related papers (2025-05-20T22:42:33Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models [29.05425041393475]
Generative Large Language Models (LLMs) hold significant promise in healthcare. This study assessed the potential of LLMs to function as autonomous agents in a simulated tertiary care medical center.
arXiv Detail & Related papers (2024-01-05T15:09:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.