Fugu-MT 論文翻訳(概要): MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

論文の概要: MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

arxiv url: http://arxiv.org/abs/2505.16700v1
Date: Thu, 22 May 2025 14:02:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-23 17:12:48.345915
Title: MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models
Title（参考訳）: MCP-RADAR:大規模言語モデルにおけるツール使用能力評価のための多次元ベンチマーク
Authors: Xuanqi Gao, Siyi Xie, Juan Zhai, Shqing Ma, Chao Shen,
Abstract要約: 本稿では,モデルコンテキストプロトコル (MCP) フレームワークにおける大規模言語モデル (LLM) の性能を評価するために設計された,最初の総合ベンチマークである MCP-RADAR を紹介する。 MCP-RADARは主観的な人的評価やバイナリ成功メトリクスに依存する従来のベンチマークとは異なり、複数のタスク領域にわたって客観的に定量化されている。
参考スコア（独自算出の注目度）: 11.809732662992982
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of tool interaction, the Model Context Protocol (MCP) has emerged as a standardized framework for dynamic tool discovery and orchestration. Despite widespread industry adoption, existing evaluation methodologies fail to adequately assess tool utilization capabilities within this new paradigm. This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance in the MCP framework through a novel five-dimensional approach measuring: answer accuracy, tool selection efficiency, computational resource efficiency, parameter construction accuracy, and execution speed. Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains including software engineering, mathematical reasoning, and general problem-solving. Our evaluations of leading commercial and open-source LLMs reveal distinctive capability profiles with significant trade-offs between accuracy, efficiency, and speed, challenging traditional single-metric performance rankings. Besides, we provide valuable guidance for developers to optimize their tools for maximum model compatibility and effectiveness. While focused on MCP due to its standardized approach, our methodology remains applicable across all LLM agent tool integration frameworks, providing valuable insights for both LLM developers and tool creators to optimize the entire LLM-tool interaction ecosystem. The implementation, configurations, and datasets used in our evaluation are publicly available at https://anonymous.4open.science/r/MCPRadar-B143.
Abstract（参考訳）: 大規模言語モデル(LLM)が受動的テキストジェネレータからツールインタラクションが可能なアクティブ推論エージェントへと進化するにつれ、モデルコンテキストプロトコル(MCP)は動的ツールの発見とオーケストレーションのための標準化されたフレームワークとして登場した。業界が広く採用されているにもかかわらず、既存の評価手法は、この新しいパラダイムにおけるツール利用能力の適切な評価に失敗している。本稿では,MCPフレームワークにおけるLCM性能を評価するために,解答精度,ツール選択効率,計算資源効率,パラメータ構築精度,実行速度の5次元的アプローチによって設計された最初の総合ベンチマークであるMPP-RADARを紹介する。 MCP-RADARは、主観的な人間の評価や二進的成功の指標に依存する従来のベンチマークとは異なり、ソフトウェア工学、数学的推論、一般的な問題解決を含む複数のタスク領域にわたる客観的な定量化手法を採用している。商用およびオープンソース LLM をリードする評価では,精度,効率,速度のトレードオフが顕著な特徴的機能プロファイルが示され,従来のシングルメトリック性能ランキングに挑戦する。さらに、最大限のモデル互換性と有効性のために、開発者がツールを最適化するための貴重なガイダンスも提供します。標準化されたアプローチのため、MPPに重点を置いているが、我々の方法論は全てのLLMエージェントツール統合フレームワークに適用可能であり、LLM開発者とツール作成者の両方がLLM-toolインタラクションエコシステム全体を最適化するための貴重な洞察を提供する。評価で使用される実装、設定、データセットはhttps://anonymous.4open.science/r/MCPRadar-B143で公開されている。

論文の概要: MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

関連論文リスト