Fugu-MT 論文翻訳(概要): StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

論文の概要: StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.18401v1
Date: Mon, 20 Apr 2026 15:22:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.969502
Title: StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Title（参考訳）: StepPO: エージェント強化学習のためのステップアラインポリシー最適化
Authors: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen,
Abstract要約: 一般的なエージェントはOpenClawやClaude Codeのような驚くべきアプリケーションを生み出している。エージェント強化学習(RL: Agentic Reinforcement Learning)は、大規模言語モデルを強化するための訓練後のパラダイムとして登場した。従来のトークンレベルのマルコフ決定プロセス(MDP)は段階的なMDPの定式化に進むべきである。
参考スコア（独自算出の注目度）: 44.2992619825834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.
Abstract（参考訳）: 一般的なエージェントはOpenClawやClaude Codeのような驚くべきアプリケーションを生み出している。これらのエージェントシステム(別名Harnesses)はより大胆な目標に向かっているため、基盤となるLarge Language Models (LLMs) からより強力なエージェント機能を要求する。エージェント強化学習(Agenic Reinforcement Learning, RL)は、これらの能力でLSMを強化するためのトレーニング後の中心的なパラダイムとして生まれ、エージェントトレーニングにおいてますます重要な役割を担っている。 RLHFやRLVRのようなシングルターントークンレベルのアライメントや推論の強化とは異なり、Agentic RLはマルチターンインタラクティブな設定を目標としている。その結果、従来のLLM RLから受け継いだトークン中心のモデリングと最適化のパラダイムは、実際のLLMエージェントの振る舞いを捉えるのに不適切になりつつある。本稿では,ステップレベルのエージェントRLの位置としてStepPOを提案する。従来のトークンレベルのマルコフ決定プロセス (MDP) は段階的に MDP の定式化に進むべきであり、トークンよりもむしろステップを LLM エージェントの適切なアクション表現と見なすべきである。そこで我々は,この定式化の自然な最適化手法として段階的信用割当を提案し,政策最適化と報酬伝達をエージェント決定の粒度と整合させる。最後に,段階的なエージェントRLの実現に必要な重要なシステム設計について論じ,予備実験により,この視点の有効性を実証する。 StepPOに具現化されているステップアラインなステップレベルパラダイムは、エージェントの振る舞いを理解するための便利なレンズを提供し、より強力な汎用能力に向けてLSMを前進させるのに役立つことを願っている。

論文の概要: StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

関連論文リスト