Fugu-MT 論文翻訳(概要): MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

論文の概要: MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

arxiv url: http://arxiv.org/abs/2508.20453v1
Date: Thu, 28 Aug 2025 05:58:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.059796
Title: MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Title（参考訳）: MCP-Bench:MPPサーバによる複雑な実世界のタスクを伴うLLMエージェントのベンチマークツール
Authors: Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow,
Abstract要約: MCP-Benchは、大規模言語モデル(LLM)を現実的なマルチステップタスクで評価するためのベンチマークである。 MCP-Bench は Model Context Protocol (MCP) 上に構築されており、金融、旅行、科学計算、学術検索などの分野にまたがる250のツールにまたがる28のライブ MCP サーバに LLM を接続している。
参考スコア（独自算出の注目度）: 24.6512259539754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.
Abstract（参考訳）: 我々は,大規模言語モデル(LLM)を評価するためのベンチマークであるMPP-Benchを紹介した。 MCP-Bench は Model Context Protocol (MCP) 上に構築されており、金融、旅行、科学計算、学術検索などの分野にまたがる250のツールにまたがる28のライブ MCP サーバに LLM を接続している。従来のAPIベースのベンチマークとは異なり、各MPPサーバは協調して動作するように設計された補完ツールセットを提供しており、リッチな入出力結合を備えた、真のマルチステップタスクの構築を可能にしている。 MCP-Benchテストエージェントのタスクは、明示的なツール名なしでファジィインストラクションから関連ツールを検索する機能、複雑な目的のための計画的なマルチホップ実行トラジェクトリ、中間ツール出力のグラウンドレスポンス、ドメイン間のワークフローのオーケストレーション – 明示的なツール仕様、浅い数ステップのワークフロー、分離されたドメイン操作に依存する既存のベンチマークでは適切に評価されていない機能である。ツールレベルのスキーマ理解と利用,軌道レベルの計画,タスク完了を対象とする多面的評価フレームワークを提案する。先進LLM20の実験は、MPP-Benchにおける永続的な課題を明らかにしている。コードとデータ:https://github.com/Accenture/mcp-bench.com

論文の概要: MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

関連論文リスト