Fugu-MT 論文翻訳(概要): RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

論文の概要: RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

arxiv url: http://arxiv.org/abs/2603.08561v3
Date: Thu, 12 Mar 2026 11:31:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.441689
Title: RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback
Title（参考訳）: RetroAgent: Retrospective Dual Intrinsic Feedbackによる解決から進化へ
Authors: Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao,
Abstract要約: RetroAgentは、エージェントが複雑なインタラクティブ環境をマスターできるオンラインRLフレームワークである。実験の結果,RetroAgentはSOTA(State-of-the-art)の性能を達成できた。
参考スコア（独自算出の注目度）: 54.39884046754265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Standard reinforcement learning (RL) for large language model (LLM)-based agents typically optimizes extrinsic task-success rewards, prioritizing one-off task solving over continual adaptation. As a result, agents may converge to suboptimal policies due to limited exploration, and accumulated experience remains implicitly stored in model parameters, hindering efficient experiential learning. Inspired by humans' capacity for retrospective self-improvement, we introduce RetroAgent, an online RL framework that enables agents to master complex interactive environments not only by solving, but also by evolving under the joint guidance of extrinsic task-success rewards and retrospective dual intrinsic feedback. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces: (1) intrinsic numerical feedback, which tracks incremental subtask completion relative to prior attempts to reward promising exploration; and (2) intrinsic language feedback, which distills reusable lessons into a memory buffer retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy, jointly balancing relevance, utility, and exploration. Extensive experiments across four challenging agentic tasks show that RetroAgent achieves state-of-the-art (SOTA) performance, substantially outperforming RL fine-tuning, memory-augmented RL, exploration-guided RL, and meta-RL methods -- e.g., exceeding Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while maintaining strong test-time adaptation and out-of-distribution generalization.
Abstract（参考訳）: 大規模言語モデル(LLM)に基づくエージェントのための標準強化学習(RL)は、通常、外在的なタスク・サクセス報酬を最適化し、継続的な適応よりもワンオフタスクの解決を優先する。その結果、エージェントは限られた探索のために最適下方策に収束し、蓄積された経験はモデルパラメータに暗黙的に保存され、効率的な経験的学習を妨げる。人間による自己改善能力に触発されたRetroAgentは、エージェントが複雑な対話環境をマスターできるオンラインRLフレームワークである。具体的には、RetroAgentは、(1)有望な探索に報いる以前の試みと比較して漸進的なサブタスク完了を追跡する内在的な数値フィードバック、(2)我々の提案したSimisity & Utility-Aware Upper Confidence Bound(SimUtil-UCB)戦略を介して、再利用可能なレッスンを記憶バッファに蒸留する内在的な言語フィードバックを生成する。例えば、ALFWorldで+18.3%、WebShopで+15.4%、Sokobanで+27.1%、MineSweeperで+8.9%、強力なテスト時間適応とアウト・オブ・ディストリビューション一般化を維持しながら、RetroAgentは最先端(SOTA)のパフォーマンス、RLファインチューニング、メモリ拡張RL、探索誘導RL、メタRLメソッドを大幅に上回った。

論文の概要: RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

関連論文リスト