Fugu-MT 論文翻訳(概要): On Advantage Estimates for Max@K Policy Gradients

論文の概要: On Advantage Estimates for Max@K Policy Gradients

arxiv url: http://arxiv.org/abs/2606.06080v1
Date: Thu, 04 Jun 2026 12:16:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.781492
Title: On Advantage Estimates for Max@K Policy Gradients
Title（参考訳）: Max@K Policy Gradientsのアドバンテージ評価について
Authors: Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo,
Abstract要約: バッチのメリットを正確に重視しながら、ポリシーの緩やかな偏りを保ちながら、リーフツーアウトのベースラインを導入します。結果、MaxPOは効率的な二次時間実装を持ち、LLM後学習のためのグループベースRLに自然に統合される。実験により,L2Oベースラインは勾配のばらつきを低減し,非中心の代替よりも優れることを確認した。
参考スコア（独自算出の注目度）: 38.07689739365912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.
Abstract（参考訳）: 検証可能な報酬を伴う強化学習は、訓練後の推論モデルに広く用いられているが、希少な結果報酬は探索を困難にしている。補完的なアプローチとして、pass@Kやmax@Kなどの推論時間の目的を直接最適化するが、これらの目的のための既存のポリシー勾配推定器では、異なる信号、ベースライン、正規化を使用しており、それらの関係は不明確である。我々はこの問題をベースライン設計とアドバンテージセンタリングを通じて研究する。この分野の先導的手法の利点推定器から、政策段階の偏りがないが、非中心的優位性をもたらすことを示す。次に、リーフツーアウトベースラインを導入します。これは、ポリシーの段階的な不偏性を維持しつつ、バッチのメリットを正確に重視します。結果、MaxPOは効率的な二次時間実装を持ち、LLM後学習のためのグループベースRLに自然に統合される。さらに、max@K に対する標準有限バッチの利点を導出し、既存の利点推定器の統一的なビューを提供する。実験により,L2Oベースラインは勾配のばらつきを低減し,非中心の代替よりも優れることを確認した。

論文の概要: On Advantage Estimates for Max@K Policy Gradients

関連論文リスト