Fugu-MT 論文翻訳(概要): MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

論文の概要: MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

arxiv url: http://arxiv.org/abs/2508.07575v1
Date: Mon, 11 Aug 2025 03:16:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.92041
Title: MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
Title（参考訳）: MCPToolBench++: ベンチマークを使用した大規模AIエージェントモデルコンテキストプロトコルMSPツール
Authors: Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo,
Abstract要約: Model Context Protocol(MCP)は、AI Agentにコンテキストを供給する標準化された方法を提供する。 LLMとAI AgentsのMPPツール使用能力の評価にはいくつかの問題がある。大規模マルチドメインAIエージェントツールのベンチマークであるMPPToolBench++を提案する。
参考スコア（独自算出の注目度）: 6.470909719300937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs' capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents' MCP tool use abilities suffer from several issues. First, there's a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies across different MCP servers. Furthermore, the LLMs' context window also limits the number of available tools that can be called in a single run, because the textual descriptions of tool and the parameters have long token length for an LLM to process all at once. To help address the challenges of evaluating LLMs' performance on calling MCP tools, we propose MCPToolBench++, a large-scale, multi-domain AI Agent tool use benchmark. As of July 2025, this benchmark is build upon marketplace of over 4k MCP servers from more than 40 categories, collected from the MCP marketplaces and GitHub communities. The datasets consist of both single-step and multi-step tool calls across different categories. We evaluated SOTA LLMs with agentic abilities on this benchmark and reported the results.
Abstract（参考訳）: LLMの機能は、関数呼び出しを使用してさまざまなデータソースやAPI結果をコンテキストウィンドウに統合することによって強化される。典型的なツールは、検索、Webクローラ、マップ、財務データ、ファイルシステム、ブラウザの利用などである。これらのデータソースや関数を統合するには、標準化された方法が必要です。 Model Context Protocol (MCP) は、LCMにコンテキストを供給するための標準化された方法を提供する。しかし、LLMとAI AgentsのMPPツールの使用能力の評価にはいくつかの問題がある。まず、様々なMPPツールを評価するための包括的なデータセットやベンチマークが欠如しています。第2に,MPPツールコールの実行による応答の多様さは,評価の難しさをさらに高める。さらに、プログラミングや数学関数などの関数で高い成功率を持つ既存のツール使用ベンチマークとは異なり、実際のMSPツールの成功率は保証されておらず、異なるMSPサーバで異なる。さらに、LLMのコンテキストウィンドウは、ツールとパラメータのテキスト記述がLLMが一度に処理するための長いトークン長を持つため、単一の実行で呼び出すことのできるツールの数を制限する。 MCP ツールの呼び出しにおける LLM の性能評価の課題に対処するために,大規模マルチドメイン AI Agent ツールのベンチマークである MCPToolBench++ を提案する。 2025年7月時点で、このベンチマークは40以上のカテゴリから4k以上のMSPサーバのマーケットプレース上に構築されており、MSPマーケットプレースとGitHubコミュニティから収集されている。データセットは、さまざまなカテゴリにわたるシングルステップとマルチステップのツールコールで構成されている。我々は,SOTA LLMをエージェント能力で評価し,その結果を報告する。

論文の概要: MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

関連論文リスト