MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
- URL: http://arxiv.org/abs/2501.14654v2
- Date: Wed, 12 Feb 2025 05:32:07 GMT
- Title: MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
- Authors: Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen,
- Abstract summary: Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents.
We introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts.
The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems.
- Score: 20.96732566767587
- License:
- Abstract: Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase. The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems. MedAgentBench presents an unsaturated agent-oriented benchmark that current state-of-the-art LLMs exhibit some ability to succeed at. The best model (Claude 3.5 Sonnet v2) achieves a success rate of 69.67%. However, there is still substantial space for improvement which gives the community a next direction to optimize. Furthermore, there is significant variation in performance across task categories. MedAgentBench establishes this and is publicly available at https://github.com/stanfordmlgroup/MedAgentBench , offering a valuable framework for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.
Related papers
- Can Modern LLMs Act as Agent Cores in Radiology Environments? [54.36730060680139]
Large language models (LLMs) offer enhanced accuracy and interpretability across various domains.
This paper aims to investigate the pre-requisite question for building concrete radiology agents.
We present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents.
Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow.
arXiv Detail & Related papers (2024-12-12T18:20:16Z) - SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents.
SPA-Bench offers three key contributions: A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines.
A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption.
arXiv Detail & Related papers (2024-10-19T17:28:48Z) - Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents [40.86728610906313]
AXIS is a novel LLM-based agents framework that prioritizes actions through application programming interfaces (APIs) over user interface actions.
Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans.
It also explores the possibility of turning every applications into agents, paving the way towards an agent-centric operating system (Agent OS)
arXiv Detail & Related papers (2024-09-25T17:58:08Z) - MMedAgent: Learning to Use Medical Tools with Multi-modal Agent [27.314055140281432]
This paper introduces the first agent explicitly designed for the medical field, named textbfMulti-modal textbfMedical textbfAgent (MMedAgent)
Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o.
arXiv Detail & Related papers (2024-07-02T17:58:23Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents [19.721008909326024]
Large language models (LLMs) have sparked a new wave of technological revolution in medical artificial intelligence (AI)
We introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness.
Within the simulacrum, doctor agents are able to evolve by treating a large number of patient agents without the need to label training data manually.
arXiv Detail & Related papers (2024-05-05T14:53:51Z) - Benchmarking Mobile Device Control Agents across Diverse Configurations [19.01954948183538]
B-MoCA is a benchmark for evaluating and developing mobile device control agents.
We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs.
While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness.
arXiv Detail & Related papers (2024-04-25T14:56:32Z) - Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models [56.00992369295851]
Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents.
This paper delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations.
We propose Agent-FLAN to effectively Fine-tune LANguage models for Agents.
arXiv Detail & Related papers (2024-03-19T16:26:10Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.