Fugu-MT 論文翻訳(概要): Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

論文の概要: Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

arxiv url: http://arxiv.org/abs/2604.27233v1
Date: Wed, 29 Apr 2026 22:09:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:53.824034
Title: Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents
Title（参考訳）: 強化エージェント:ツールカートリングエージェントの推論時間フィードバック
Authors: Anh Ta, Junjie Zhu, Shahin Shayandeh,
Abstract要約: エージェントフィードバックとレビュアーフィードバックのトレードオフを測定するために、ヘルプフルネス・ハームフルネスメトリクスを導入します。我々はBFCLとTau2-Bench(マルチターンステートフルシナリオ)に対するアプローチを評価し、無関係検出では+5.5%、マルチターンタスクでは+7.1%を達成した。 GPT-4oでは,評価モデルo3-miniが3:1の利益率と2.1:1の利益率を達成した。
参考スコア（独自算出の注目度）: 6.158612515104146
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation. In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value. We evaluate our approach on BFCL (single-turn) and Tau2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5-2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.
Abstract（参考訳）: ツールコールエージェントは、ツール選択、パラメータ精度、スコープ認識に基づいて評価されるが、LLM軌道評価は本質的にポストホックのままである。アクティブな実行ループから切り離されたアセスメントは、通常、プロンプトチューニングや再トレーニングによって対処されるエラーを識別し、基本的にリアルタイムでエージェントをコース修正することができない。このギャップを埋めるために、我々は評価を推論時に実行ループに移動させる:特別審査員エージェントは実行前に仮ツールコールを評価し、そのパラダイムをポストホックリカバリからアクティブな評価とエラー軽減へとシフトさせる。実際には、このアーキテクチャは、プライマリ実行エージェントとセカンダリレビューエージェントの間の関心事を明確に分離する。マルチエージェントシステムと同様に、レビュアーは、他のシステムを修正しながら新しいエラーを発生させることができるが、我々の知識に対する事前の研究は、このトレードオフを体系的に測定していない。このトレードオフを定量化するために、我々はHelpfulness-Harmfulnessメトリクスを導入し、フィードバックが修正するベースエージェントエラーの割合、フィードバックが劣化する正しいレスポンスの比率を測定します。これらの指標は、与えられたモデルまたはプロンプトが正の値を提供するかどうかを明らかにすることで、レビュアー設計に直接通知する。我々はBFCL(シングルターン)とTau2-Bench(マルチターンステートフルシナリオ)に対するアプローチを評価し、無関係検出では+5.5%、マルチターンタスクでは+7.1%を達成した。 GPT-4oの場合,o3-miniは3:1の利益率と2.1:1の利益率を達成する。 GEPAによる自動プロンプト最適化は、追加の1.5-2.8%を提供する。これらの結果は、実行とレビューを分離する上で、コアとなる利点を示している。レビューアは、ベースエージェントを再トレーニングすることなく、モデル選択と迅速な最適化によって体系的に改善することができる。

論文の概要: Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

関連論文リスト