Fugu-MT 論文翻訳(概要): Co-Evolution of Policy and Internal Reward for Language Agents

論文の概要: Co-Evolution of Policy and Internal Reward for Language Agents

arxiv url: http://arxiv.org/abs/2604.03098v1
Date: Fri, 03 Apr 2026 15:21:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.508012
Title: Co-Evolution of Policy and Internal Reward for Language Agents
Title（参考訳）: 言語エージェントのための政策と内部報酬の共進化
Authors: Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, Bang Liu,
Abstract要約: 大規模言語モデル(LLM)エージェントは環境と相互作用して学習するが、長期訓練はスパースと遅延報酬によってボトルネックに残っている。本稿では,推論時指導と訓練時監督の両方をサポートする言語エージェントに対する自己生成型内部報酬であるセルフガイドを提案する。
参考スコア（独自算出の注目度）: 37.41307226473692
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントは環境と相互作用して学習するが、長期訓練はスパースと遅延報酬によって基本的にボトルネックとなる。既存の方法は、通常、ポストホックのクレジット代入または外部報酬モデルを通じてこの問題に対処し、推論時に限られたガイダンスを提供し、しばしば報酬改善と政策改善を分離する。本稿では,推論時指導と訓練時監督の両方をサポートする言語エージェントに対する自己生成型内部報酬であるセルフガイドを提案する。具体的には、エージェントは、短い自己誘導信号としてセルフガイドを使用して、推論中に次のアクションを操縦し、トレーニング中により密集したポリシー最適化のために、同じ信号をステップレベルの内部報酬に変換する。より良いポリシーはより良いガイダンスを生み出し、より良いガイダンスは内部報酬としてポリシーをさらに改善します。 3つのベンチマークで、推論時の自己指導は、すでに明確な利得を得ており、GRPOとの共同進化政策と内部報酬は、環境報酬のみで訓練されたベースラインよりも、さらなる改善(8倍)をもたらす。以上の結果から,言語エージェントは,経験を多く集めるだけでなく,行動や学習中に自己の内的報酬を生成・改善することで,改善できる可能性が示唆された。

論文の概要: Co-Evolution of Policy and Internal Reward for Language Agents

関連論文リスト