Related papers: OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

URL: http://arxiv.org/abs/2510.24563v1
Date: Tue, 28 Oct 2025 15:56:36 GMT
Title: OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
Authors: Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, Fei Huang,
Abstract summary: We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities.<n> Rigorous manual validation yields 158 high-quality tools, each verified for correct functionality, practical applicability, and versatility.<n> OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments.
Score: 49.34040731113563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.

Related papers

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers [5.463884405989425]
We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency.<n>It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step orchestrate.<n>We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer.
arXiv Detail & Related papers (2026-01-31T23:19:39Z)
MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use [12.220519951554133]
MCPAgentBench is a benchmark based on real-world MCP definitions to evaluate the tool-use capabilities of agents.<n>The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors.<n>Experiments conducted on various latest mainstream Large Language Models reveal significant performance differences in handling complex, multi-step tool invocations.
arXiv Detail & Related papers (2025-12-31T02:09:48Z)
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning [55.221850286246]
We introduce MindWatcher, a tool-integrated reasoning agent with interleaved thinking and multimodal chain-of-thought (CoT) reasoning.<n>MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use.<n>A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition.
arXiv Detail & Related papers (2025-12-29T12:16:12Z)
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution [86.4588675093384]
Toolathlon is a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.<n>This benchmark includes 108 manually sourced or crafted tasks, requiring interacting with multiple Apps over around 20 turns on average to complete.<n>We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
arXiv Detail & Related papers (2025-10-29T17:32:49Z)
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools [12.249551019598442]
We introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services.<n>We also provide manually annotated ground-truth tools for each task.<n>Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments.
arXiv Detail & Related papers (2025-10-22T06:42:01Z)
MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use [72.53177559476704]
We introduce MCPVerse, a real-world benchmark for evaluating agentic tool use.<n> MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens.<n>We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale)
arXiv Detail & Related papers (2025-08-22T09:47:53Z)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z)
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? [50.60770039016318]
We present LiveMCPBench, the first comprehensive benchmark for benchmarking Model Context Protocol (MCP) agents.<n>LiveMCPBench consists of 95 real-world tasks grounded in the MCP ecosystem.<n>Our evaluation covers 10 leading models, with the best-performing model reaching a 78.95% success rate.
arXiv Detail & Related papers (2025-08-03T14:36:42Z)
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [33.250579401886206]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance within the Model Context Protocol (MCP) framework.<n> MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations.<n>Unlike traditional benchmarks that rely on subjective human evaluation or binary success metrics, MCP-RADAR adopts objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z)
Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)
MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation [25.360660222418183]
We present MetaTool, a novel tool learning methodology designed to generalize across any reusable toolset. By incorporating meta-task data into task-oriented training, our method significantly enhances the performance of open-source Large Language Models.
arXiv Detail & Related papers (2024-07-15T10:15:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.