Fugu-MT 論文翻訳(概要): Policy Improvement Reinforcement Learning

論文の概要: Policy Improvement Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.00860v1
Date: Wed, 01 Apr 2026 13:10:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.999406
Title: Policy Improvement Reinforcement Learning
Title（参考訳）: 政策改善強化学習
Authors: Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban,
Abstract要約: Reinforcement Learning with Verifiable Rewards (RLVR) は、大規模言語モデルの推論能力を改善するためのトレーニング後の中心的なパラダイムとなっている。既存のメソッドは共通の盲点を共有している: 結果の更新によってモデルが実際に改善されたかどうかを検証することなく、即時のグループレベルまたはバッチレベルの統計に基づいてポリシーを最適化する。我々は、政策改善のフィードバックが欠落していること、すなわち、中間段階の進捗を直接測定し、最適化する能力が欠けていることを論じる。
参考スコア（独自算出の注目度）: 40.05196753615896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) は、大規模言語モデルの推論能力を改善するためのトレーニング後の中心的なパラダイムとなっている。更新結果が実際にモデルを改善したかどうかを検証することなく、即時のグループレベルまたはバッチレベルの統計に基づいてポリシーを最適化する。このオープンループ設計は、各ステップで独立して更新され、グループ内(バッチ)報酬信号のみによってガイドされる。我々は、政策改善のフィードバックが欠落していること、すなわち、中間段階の進捗を直接測定し、最適化する能力が欠けていることを論じる。この目的のために我々は,サロゲート報酬の最大化を,反復の累積的な政策改善を最大化する明示的な目的に置き換える枠組みであるPIRLを導入し,この時間的目標が最終タスク性能の最大化と完全に整合していることを証明する。 PIRLに基づく政策改善政策最適化(PIPO)を提案する。各イテレーションにおいて、PIPOは、前回の更新がスライディングウインドウの歴史的なベースラインに対して真に改善したかどうかを評価し、有効な更新を積極的に強化し、有害な更新を抑圧します。我々は,PIPOが期待値においてPIRL目標を上昇させることを示す理論的解析を行い,GRPOとその変種に対する安定性と性能の向上を示す数学的推論ベンチマークの実験を行った。

論文の概要: Policy Improvement Reinforcement Learning

関連論文リスト