Related papers: GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

URL: http://arxiv.org/abs/2510.13734v1
Date: Wed, 15 Oct 2025 16:40:28 GMT
Title: GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians
Authors: Xiuyuan Chen, Tao Sun, Dexin Su, Ailing Yu, Junwei Liu, Zhe Chen, Gangzeng Jin, Xin Wang, Jingnan Liu, Hansong Xiao, Hualei Zhou, Dongjie Tao, Chunxiao Guo, Minghui Yang, Yuan Xia, Jing Zhao, Qianrui Fan, Yanyun Wang, Shuai Zhen, Kezhong Chen, Jun Wang, Zewen Sun, Heng Zhao, Tian Guan, Shaodong Wang, Geyun Chang, Jiaming Deng, Hongchengcheng Chen, Kexin Feng, Ruzhen Li, Jiayi Geng, Changtai Zhao, Jun Wang, Guihu Lin, Peihao Li, Liqi Liu, Peng Wei, Jian Wang, Jinjie Gu, Ping Wang, Fan Yang,
Abstract summary: Current benchmarks for AI clinician systems fail to capture the depth, robustness, and safety required for real-world clinical practice.<n>We introduce the GAPS framework, a multidimensional paradigm for evaluating textbfGrounding (cognitive depth), textbfAdequacy (answer completeness), textbfPerturbation (robustness), and textbfSafety.<n>We develop a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end.
Score: 32.33432636089606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.

Related papers

Bridging the Knowledge-Action Gap by Evaluating LLMs in Dynamic Dental Clinical Scenarios [9.865786198063644]
The transition of Large Language Models (LLMs) from passive knowledge retrievers to autonomous clinical agents demands a shift in evaluation-from static accuracy to dynamic behavioral reliability.<n>This study empirically charts the capability boundaries of dental LLMs, providing a roadmap for bridging the gap between standardized knowledge and safe, autonomous clinical practice.
arXiv Detail & Related papers (2026-01-19T11:36:39Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z)
MIRAGE: Misleading Retrieval-Augmented Generation via Black-box and Query-agnostic Poisoning Attacks [47.46936341268548]
Retrieval-Augmented Generation (RAG) systems introduce a critical attack surface: corpus poisoning.<n>We propose MIRAGE, a novel multi-stage poisoning pipeline designed for strict black-box and query-agnostic environments.<n>Extensive experiments demonstrate that MIRAGE significantly outperforms existing baselines in both attack efficacy and stealthiness.
arXiv Detail & Related papers (2025-12-09T06:38:16Z)
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support [10.524387723320432]
MindEval is a framework for automatically evaluating language models in realistic, multi-turn mental health therapy conversations.<n>We quantitatively validate the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments.<n>We evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6 on average, with particular weaknesses in problematic AI-specific patterns of communication.
arXiv Detail & Related papers (2025-11-23T15:19:29Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture [8.072932739333309]
We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap.<n>The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes.<n>A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus.
arXiv Detail & Related papers (2025-08-29T17:31:24Z)
CyberRAG: An Agentic RAG cyber attack classification and reporting tool [0.3914676152740142]
CyberRAG is a modular agent-based RAG framework that delivers real-time classification, explanation, and structured reporting for cyber-attacks.<n>Unlike traditional RAG, CyberRAG adopts an agentic design that enables dynamic control flow and adaptive reasoning.
arXiv Detail & Related papers (2025-07-03T08:32:19Z)
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.<n>This approach was originally developed for the TREC Question Answering (QA) Track in 2003.<n>We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
Efficient Lung Ultrasound Severity Scoring Using Dedicated Feature Extractor [12.280417624228544]
MeDiVLAD is a novel pipeline for multi-level lung-ultrasound severity scoring.<n>We leverage self-knowledge distillation to pretrain a vision transformer (ViT) without label and aggregate frame-level features.<n>We show that with minimal finetuning, MeDiVLAD outperforms conventional fully-supervised methods in both frame- and video-level scoring.
arXiv Detail & Related papers (2025-01-21T22:28:22Z)
ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems [0.0]
Large Language Models (LLMs) have shown impressive potential in clinical question answering.<n>RAG is emerging as a leading approach for ensuring the factual accuracy of model responses.<n>Current automated RAG metrics perform poorly in clinical and conversational use cases.
arXiv Detail & Related papers (2025-01-14T15:46:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.