Fugu-MT 論文翻訳(概要): Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

論文の概要: Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

arxiv url: http://arxiv.org/abs/2605.20061v1
Date: Tue, 19 May 2026 16:19:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.515196
Title: Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents
Title（参考訳）: Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents
Authors: Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou,
Abstract要約: 検証可能な報酬(RLVR)からの強化学習は、長期的対話的タスクにおいて、大規模言語モデル(LLM)エージェントを改善するための有望なパラダイムである。本稿では,構造化信念状態を明示的にモデル化したプロセスレベル強化学習アルゴリズムReBel(Reward Belief)を提案する。我々は、ALFWorldやWebShopといった長軸ベンチマークに挑戦する上で、ReBelを評価する。
参考スコア（独自算出の注目度）: 5.917866758929418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.
Abstract（参考訳）: 検証可能な報酬(RLVR)からの強化学習は、長期的対話的タスクにおいて、大規模言語モデル(LLM)エージェントを改善するための有望なパラダイムである。しかし、部分的に観察可能な環境では、不完全な観察によってエージェントの信念は時間の経過とともに漂流し、一方で遅延した報酬は中間決定の因果的影響を曖昧にし、時間的信用割り当ての課題を悪化させる。そこで我々はReBel(Reward Belief)を提案する。ReBel(Reward Belief)はプロセスレベルの強化学習アルゴリズムで、構造化された信念状態を明確にモデル化し、相互作用履歴を要約し、その後の政策学習を導く。 ReBelは信念と一貫性の監視を導入し、予測された信念と観察されたフィードバックの間に不一致を変換し、外部のステップワイドアノテーションや検証を必要とせず、密集した自己管理信号に変換する。また、同様の信念状態下での軌跡を比較するために、信念を意識したグループ化を採用し、より堅牢で低分散の有利な推定をもたらす。我々は、ALFWorldやWebShopといった長軸ベンチマークに挑戦する上で、ReBelを評価する。 ReBelは、エピソードレベルのベースラインGRPOよりも最大20.4ドルのパーセンテージでタスクの成功を向上し、サンプル効率を2.1\times$に向上させる。これらの結果は,信念を意識した自己超越が,部分的可観測性の下での信頼性の高い長期的意思決定に期待できる方向であることを示唆している。コードは、https://github.com/Fateyetian/Rebel.git.comで入手できる。

論文の概要: Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

関連論文リスト