Related papers: MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

URL: http://arxiv.org/abs/2506.07672v1
Date: Mon, 09 Jun 2025 11:50:33 GMT
Title: MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents
Authors: Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu,
Abstract summary: We propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents.<n>A key principle of MCPWorld is the use of "white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed.<n> MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels.
Score: 14.736516215309768
License: http://creativecommons.org/licenses/by/4.0/
Abstract: (M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of "white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld.

Related papers

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? [50.60770039016318]
We present LiveMCPBench, the first comprehensive benchmark for benchmarking Model Context Protocol (MCP) agents.<n>LiveMCPBench consists of 95 real-world tasks grounded in the MCP ecosystem.<n>Our evaluation covers 10 leading models, with the best-performing model reaching a 78.95% success rate.
arXiv Detail & Related papers (2025-08-03T14:36:42Z)
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction [35.285466934451904]
This paper introduces textscInfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner.<n>Unlike existing approaches that either build intricate around a single large model or only provide modularity, our agent integrates tool-based and pure vision agents.
arXiv Detail & Related papers (2025-05-16T05:43:27Z)
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents [0.0]
OSUniverse is a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents.<n>We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent.<n>The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%.
arXiv Detail & Related papers (2025-05-06T14:29:47Z)
UFO2: The Desktop AgentOS [60.317812905300336]
UFO2 is a multiagent AgentOS for Windows desktops that elevates into practical, system-level automation.<n>We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs.<n>Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
arXiv Detail & Related papers (2025-04-20T13:04:43Z)
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.<n>From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.<n>From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images.<n>By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data.<n>Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z)
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z)
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP) With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.