Fugu-MT 論文翻訳(概要): Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

論文の概要: Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

arxiv url: http://arxiv.org/abs/2605.06139v1
Date: Thu, 07 May 2026 12:38:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.777485
Title: Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Title（参考訳）: リストワイズポリシー最適化:LLM応答簡易性に基づく目標投影としてのグループベースRLVR
Authors: Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji,
Abstract要約: 検証可能な報酬(RLVR)による強化学習は、推論能力のインセンティブを得るための訓練後の大規模言語モデル(LLM)の標準的アプローチとなっている。この研究は、これらの最適化戦略が共通の幾何学的構造を共有していることを明らかにする。本稿では,ターゲット投影を明示的に行うためにLPO(Listwise Policy Optimization)を提案する。これは応答単純度に近似RLの目的を限定することで暗黙の目標をデミストし,正確な発散最小化によってポリシーを投影する。
参考スコア（独自算出の注目度）: 43.502315311491635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.
Abstract（参考訳）: 検証可能な報酬(RLVR)による強化学習は、推論能力のインセンティブを得るための訓練後の大規模言語モデル(LLM)の標準的アプローチとなっている。既存のレシピの中で、グループベースのポリシー勾配が一般的であり、プロンプト毎にレスポンスのグループをサンプリングし、グループ相対的なアドバンテージ信号を通じてポリシーを更新する。この研究は、これらの最適化戦略が共通の幾何学的構造を共有していることを明らかにする。この知見に基づいて、我々は、ターゲット投射を明示的に行うためのリスワイズポリシー最適化(LPO)を提案し、これは、応答の単純さに近似RL目標を制限し、暗黙の目標をデミストし、正確な発散最小化を通じてポリシーを投影する。このフレームワークが提供します (i)有界、零サム、自己補正射影勾配によるリストワイド目的に対する単調な改善、及び二分離射影工程による異なる構造特性を有する発散選択の柔軟性。多様な推論タスクとLLMバックボーンにおいて、LPOは、最適化安定性と応答多様性を本質的に保ちながら、マッチした目標の下での典型的な方針勾配ベースラインよりもトレーニング性能を一貫して改善する。

論文の概要: Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

関連論文リスト