Fugu-MT 論文翻訳(概要): Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

論文の概要: Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

arxiv url: http://arxiv.org/abs/2603.05578v1
Date: Thu, 05 Mar 2026 17:44:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:44.290203
Title: Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent
Title（参考訳）: Tool-Genesis: セルフ進化型言語エージェントのためのタスク駆動ツール作成ベンチマーク
Authors: Bowei Xia, Mengkang Hu, Shijian Wang, Jiarui Jin, Wenxiang Jiao, Yuan Lu, Kexin Li, Ping Luo,
Abstract要約: Tool-Genesisは、複数の次元にわたるエージェント能力の定量化のために設計された診断ベンチマークである。最先端モデルでさえ、ワンショット設定で正確なツールインターフェースや実行可能なロジックを生成するのに苦労しています。
参考スコア（独自算出の注目度）: 45.450766613995135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Research on self-evolving language agents has accelerated, drawing increasing attention to their ability to create, adapt, and maintain tools from task requirements. However, existing benchmarks predominantly rely on predefined specifications, which limits scalability and hinders truly autonomous evolution. While recent studies attempt to dynamically generate tools, they primarily emphasize downstream performance, resulting in a "black-box" evaluation that makes it difficult to attribute failures to specific causes. To address this, we propose Tool-Genesis, a diagnostic benchmark designed to quantify agent capabilities across multiple dimensions, including interface compliance, functional correctness, and downstream utility. Tool-Genesis evaluates whether agents can construct task-relevant tools solely from abstract requirements (without preset specifications) and use them to solve realistic problems. Crucially, we find that even state-of-the-art models struggle to produce precise tool interfaces or executable logic in a one-shot setting. These minor initial flaws are amplified through the pipeline, leading to a sharp degradation in downstream metrics. We hope Tool-Genesis will guide future research toward training and steering models to synthesize persistent, general-purpose tools that better address real-world challenges.
Abstract（参考訳）: 自己進化型言語エージェントの研究が加速し、タスク要求からツールを作成し、適応し、維持する能力に注目が集まるようになった。しかし、既存のベンチマークは主に事前定義された仕様に依存しており、スケーラビリティを制限し、真に自律的な進化を妨げる。近年の研究は、ツールを動的に生成しようと試みているが、主に下流のパフォーマンスを強調しており、結果として、特定の原因による障害の属性付けが困難になる"ブラックボックス"評価につながっている。この問題を解決するために,インタフェースコンプライアンス,機能的正当性,下流ユーティリティなど,複数の次元にわたるエージェント能力の定量化を目的とした診断ベンチマークであるTool-Genesisを提案する。 Tool-Genesisは、エージェントが(事前設定された仕様なしで)抽象的な要件からのみタスク関連ツールを構築し、現実的な問題を解決するためにそれらを使用することができるかどうかを評価する。重要なことに、最先端のモデルでさえ、ワンショット設定で正確なツールインターフェースや実行可能なロジックを生成するのに苦労している。これらの小さな初期欠陥はパイプラインを通じて増幅され、下流のメトリクスが大幅に低下する。 Tool-Genesisは、トレーニングとステアリングモデルに向けた将来の研究をガイドして、現実世界の課題にもっとうまく対処するための、永続的で汎用的なツールを合成したいと思っています。

論文の概要: Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

関連論文リスト