Fugu-MT 論文翻訳(概要): ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation

論文の概要: ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation

arxiv url: http://arxiv.org/abs/2601.06328v1
Date: Fri, 09 Jan 2026 21:59:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:00.752144
Title: ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation
Title（参考訳）: ToolGym: スケーラブルエージェントテストとデータキュレーションのためのオープンソースのツール使用環境
Authors: Ziqiao Xi, Shuang Liang, Qi Liu, Jiaqing Zhang, Letian Peng, Fang Nan, Meshal Nayim, Tianhui Zhang, Rishika Mundada, Lianhui Qin, Biwei Huang, Kun Zhou,
Abstract要約: 一般的な204のアプリにまたがって,571フォーマットの統一ツール上に構築された,オープンワールドのツール使用環境を紹介します。これには、ロングホライゾンを合成するタスク生成エンジン、ワイルド制約付きマルチツール、ストレス-テストの堅牢性に割り込みと失敗を注入するステートコントローラが含まれる。最先端のLLMの総合評価では、ツール計画と実行能力の相違、既存のLLMの弱点に続く制約、DeepSeek-v3.2の強靭さが明らかにされている。
参考スコア（独自算出の注目度）: 42.479399507055454
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-using LLM agents still struggle in open-world settings with large tool pools, long-horizon objectives, wild constraints, and unreliable tool states. For scalable and realistic training and testing, we introduce an open-world tool-using environment, built on 5,571 format unified tools across 204 commonly used apps. It includes a task creation engine that synthesizes long-horizon, multi-tool workflows with wild constraints, and a state controller that injects interruptions and failures to stress-test robustness. On top of this environment, we develop a tool select-then-execute agent framework with a planner-actor decomposition to separate deliberate reasoning and self-correction from step-wise execution. Comprehensive evaluation of state-of-the-art LLMs reveals the misalignment between tool planning and execution abilities, the constraint following weakness of existing LLMs, and DeepSeek-v3.2's strongest robustness. Finally, we collect 1,170 trajectories from our environment to fine-tune LLMs, achieving superior performance to baselines using 119k samples, indicating the environment's value as both a realistic benchmark and a data engine for tool-using agents. Our code and data will be publicly released.
Abstract（参考訳）: ツールを使用するLLMエージェントは、大きなツールプール、長い水平目標、ワイルド制約、信頼性の低いツールステートを備えた、オープンワールド設定で依然として苦労している。スケーラブルで現実的なトレーニングとテストのために、私たちは、204のよく使われるアプリにまたがる571のフォーマット統一ツール上に構築された、オープンソースのツール使用環境を導入しました。これには、長期にわたるマルチツールワークフローをワイルドな制約で合成するタスク生成エンジンと、ストレス-テストの堅牢性に中断と障害を注入するステートコントローラが含まれている。この環境上には,段階的実行から意図的推論と自己補正を分離する,プランナー・アクター分解を備えたツール選択実行エージェントフレームワークが開発されている。最先端のLLMの総合評価では、ツール計画と実行能力の相違、既存のLLMの弱点に続く制約、DeepSeek-v3.2の強靭さが明らかにされている。最後に,環境から1,170個のトラジェクトリを抽出し,119kサンプルを用いてベースラインに優れた性能を実現し,現実的なベンチマークとツール用エージェントのデータエンジンとしての価値を示す。コードとデータは公開されます。

論文の概要: ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation

関連論文リスト