Fugu-MT 論文翻訳(概要): GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

論文の概要: GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

arxiv url: http://arxiv.org/abs/2604.15715v1
Date: Fri, 17 Apr 2026 05:36:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:19.75218
Title: GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
Title（参考訳）: GTA-2:Atomic Tool-Useからオープンソースワークフローへの一般的なツールエージェントのベンチマーク
Authors: Jize Wang, Xuanxuan Liu, Yining Li, Songyang Zhang, Yijun Wang, Zifei Shan, Xinyi Le, Cailian Chen, Xinping Guan, Dacheng Tao,
Abstract要約: GTA-2はジェネラル・ツール・エージェント(GTA)の階層的なベンチマークである現実世界の認証に基づいて構築され、実際のユーザクエリ、デプロイツール、マルチモーダルコンテキストを活用する。実験では、フロンティアモデルは既に原子タスクに苦戦しているが、トップモデルは14.39%の成功しか達成していない。
参考スコア（独自算出の注目度）: 90.35728421223673
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.
Abstract（参考訳）: 汎用エージェントの開発には、単純な命令の実行から、複雑な現実世界の生産性ワークフローの完了への移行が必要である。しかし、現在のツール使用ベンチマークは、AI生成クエリ、ダミーツール、システムレベルの調整の制限など、現実世界の要件と不一致のままである。そこで我々は,汎用ツールエージェント(GTA)の階層的ベンチマークであるGTA-2を提案する。現実世界の認証に基づいて構築され、実際のユーザクエリ、デプロイツール、マルチモーダルコンテキストを活用する。 i) 前回のGTAベンチマークから継承したGTA-Atomicは,短時間のクローズドエンドツール使用精度を評価する。 (ii) GTA-Workflowは、現実的なエンドツーエンド補完のための長期的かつオープンなタスクを導入します。本研究では,目標を検証可能なサブゴールに分解し,モデル機能とエージェント実行フレームワーク(実行ハーネス)の統一的な評価を可能にする再帰的チェックポイントに基づく評価機構を提案する。実験では、フロンティアモデルがすでにアトミックなタスク(50%以下)で苦労しているが、ワークフローではほとんど失敗し、トップモデルは14.39%しか成功していない。 ManusやOpenClawといった先進的なフレームワークはワークフローの補完を大幅に強化し、基礎となるモデルキャパシティを超えた実行ハーネス設計の重要性を強調している。これらの知見は、信頼性の高いパーソナルアシスタントとプロのアシスタントを開発するためのガイダンスを提供する。データセットとコードはhttps://github.com/open-compass/GTA.comから入手できる。

論文の概要: GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

関連論文リスト