Fugu-MT 論文翻訳(概要): Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

論文の概要: Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

arxiv url: http://arxiv.org/abs/2510.02837v1
Date: Fri, 03 Oct 2025 09:19:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.334026
Title: Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
Title（参考訳）: 最終回答を超えて:ツール強化エージェントの推論軌道の評価
Authors: Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park,
Abstract要約: エージェントのパフォーマンスを適切に評価するには、最終回答を超え、問題解決の軌跡も評価する必要がある。ツール拡張LDMエージェント性能の多次元評価のためのフレームワークであるTRACEを紹介する。 TRACEはこれらの複雑な挙動を,スケーラブルで費用対効果の高い方法で正確に評価する。
参考スコア（独自算出の注目度）: 22.781523439717223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent's trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.
Abstract（参考訳）: 最近のツール拡張ベンチマークでは複雑なユーザリクエストと多様なツールが組み込まれているが、そのほとんどは回答マッチングに限られている。しかし、ユーザ要求を解決するために必要なステップの数が増えるにつれて、エージェントのパフォーマンスの適切な評価は最終回答を超え、効率性、幻覚、適応性といったこれまで無視されていた側面を含む問題解決の軌跡も評価する必要がある。これらの側面を評価する最も簡単な方法は、エージェントの軌道と接地軌道を比較することであるが、すべての有効な接地軌道の注釈付けは違法に高価であるため、基本的に制限されている。しかし、単純なLCMベースの評価器は、基礎的な真実なしに軌道を詳細に評価するのに苦労している。このようなエージェントを効果的に評価するために,ツール拡張LDMエージェント性能の多次元評価のためのフレームワークであるTRACEを導入する。先行する推論ステップから収集された知識を蓄積するエビデンスバンクを組み込むことで、TRACEはエージェントの推論軌道を効果的に多面的に分析し評価することができる。本フレームワークの有効性を検証するため,既存のベンチマークを多面的性能スコアでラベル付けした多種多様なトラジェクトリで拡張し,新しいメタ評価データセットを開発した。その結果,TRACE はオープンソース LLM であっても,これらの複雑な動作をスケーラブルで費用対効果の高い方法で正確に評価できることを確認した。さらに,ツール拡張タスクを解きながらエージェントが生成する軌跡の評価に本手法を適用し,未報告の観測結果とそれに対応する知見を提示する。

論文の概要: Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

関連論文リスト