Related papers: ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

URL: http://arxiv.org/abs/2601.08988v1
Date: Tue, 13 Jan 2026 21:26:11 GMT
Title: ART: Action-based Reasoning Task Benchmarking for Medical AI Agents
Authors: Ananya Mantravadi, Shivali Dalmia, Abhishek Mukherji,
Abstract summary: We introduce Action-based Reasoning clinical Task benchmark for medical AI agents.<n>We identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments.<n>Our four-stage pipeline produces diverse, clinically validated tasks grounded in real patient data.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable clinical decision support requires medical AI agents capable of safe, multi-step reasoning over structured electronic health records (EHRs). While large language models (LLMs) show promise in healthcare, existing benchmarks inadequately assess performance on action-based tasks involving threshold evaluation, temporal aggregation, and conditional logic. We introduce ART, an Action-based Reasoning clinical Task benchmark for medical AI agents, which mines real-world EHR data to create challenging tasks targeting known reasoning weaknesses. Through analysis of existing benchmarks, we identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments. Our four-stage pipeline -- scenario identification, task generation, quality audit, and evaluation -- produces diverse, clinically validated tasks grounded in real patient data. Evaluating GPT-4o-mini and Claude 3.5 Sonnet on 600 tasks shows near-perfect retrieval after prompt refinement, but substantial gaps in aggregation (28--64%) and threshold reasoning (32--38%). By exposing failure modes in action-oriented EHR reasoning, ART advances toward more reliable clinical agents, an essential step for AI systems that reduce cognitive load and administrative burden, supporting workforce capacity in high-demand care settings

Related papers

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark [0.5066646435185324]
We investigate how human guidance of agentic AI can improve multimodal clinical prediction.<n>We present our approach to three benchmark challenges: 30-day hospital prediction, emergency department cost forecasting, and discharge readiness assessment.<n>Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task.
arXiv Detail & Related papers (2026-02-23T04:37:45Z)
Strong Reasoning Isn't Enough: Evaluating Evidence Elicitation in Interactive Diagnosis [29.630872344186873]
Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty.<n>Existing evaluations largely remain static or outcome-centric, neglecting the evidence-gathering process.<n>We propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a revsimulated reporter grounded in atomic evidences.
arXiv Detail & Related papers (2026-01-27T16:36:35Z)
AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z)
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z)
Human-in-the-Loop Interactive Report Generation for Chronic Disease Adherence [17.904419827298074]
Chronic disease management requires regular adherence feedback to prevent avoidable hospitalizations.<n>Manual authoring preserves clinical accuracy but does not scale; AI generation scales but can undermine trust in patient-facing contexts.<n>We present a clinician-in-the-loop interface that constrains AI to data organization and preserves physician oversight through recognition-based review.
arXiv Detail & Related papers (2026-01-10T00:19:33Z)
MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI [66.0701326117134]
MedForget is a hierarchy-aware multimodal unlearning testbed for building compliant medical AI systems.<n>We show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance.<n>We introduce a reconstruction attack that progressively adds hierarchical level context to prompts.
arXiv Detail & Related papers (2025-12-10T17:55:06Z)
DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services [49.70819009392778]
Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers.<n>This study aimed to develop and evaluate a taxonomy-grounded, multi-agent system for simulating realistic scenarios.
arXiv Detail & Related papers (2025-10-24T08:01:21Z)
Trainee Action Recognition through Interaction Analysis in CCATT Mixed-Reality Training [1.5641818606249476]
Critical Care Air Transport Team members must stabilize severely injured soldiers by managing ventilators, IV pumps, and suction devices during flight.<n>Recent advances in simulation and multimodal data analytics enable more objective and comprehensive performance evaluation.<n>This study examines how CCATT members are trained using mixed-reality simulations that replicate the high-pressure conditions of aeromedical evacuation.
arXiv Detail & Related papers (2025-09-22T15:19:45Z)
Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture [8.072932739333309]
We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap.<n>The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes.<n>A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus.
arXiv Detail & Related papers (2025-08-29T17:31:24Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments [2.567146936147657]
We introduce AgentClinic, a multimodal agent benchmark for evaluating large language models (LLM) in simulated clinical environments.<n>We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy.
arXiv Detail & Related papers (2024-05-13T17:38:53Z)
Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals [4.799783526620609]
We released a catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP) A total of 450 NLP datasets were manually systematized and annotated with rich metadata. Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed.
arXiv Detail & Related papers (2022-01-18T15:05:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.