Fugu-MT 論文翻訳(概要): LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

論文の概要: LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

arxiv url: http://arxiv.org/abs/2605.19416v2
Date: Fri, 22 May 2026 09:30:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 14:44:53.694782
Title: LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Title（参考訳）: LambdaPO: 言語モデルの推論のためのLambdaスタイルポリシー最適化
Authors: Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao,
Abstract要約: グループ相対政策最適化は、明示的な価値批判を先導する効果で評価されている。群平均のようなモノリシックな統計ベースラインへの依存は、軌道空間の相対トポロジーを1つのスカラーに分解する。我々は、この情報理論のボトルネックに対処する新しいフレームワークLambda Policy Optimization(LambdaPO)を紹介します。
参考スコア（独自算出の注目度）: 34.349722314481824
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.
Abstract（参考訳）: グループ相対政策最適化(GRPO)は現代の強化学習アライメントの基盤となり、サンプル軌跡コホート間の報酬正規化を活用することで、明示的な価値批判を先導する効果で評価されている。しかし、群平均のようなモノリシックな統計ベースラインへの依存は、軌道空間のリレーショナルトポロジを1つのスカラーに分解し、複雑でランクに敏感な報酬景観をナビゲートするのに必要となる、きめ細かい選好情報を消去する。この問題に対処するために、スカラー値から分解されたペアの選好構造への利点推定を再概念化することにより、この情報理論のボトルネックに対処する新しいフレームワークLambdaPO(LambdaPO)を導入する。具体的には、任意の軌道の利点は、そのコホート内の全てのピアに対する報酬差の積分和として定式化され、それぞれのペア比較は、確立された嗜好に対するポリシーの確率的信頼によって動的に減衰される。二つの結果監視の空間性をさらに軽減するため, 生成した推論トレースと地道解との精度・リコールアライメントから, 意味密度報酬を用いて目的を増強する。その結果,LLMを最適に導くことで,ロールアウト群からより微細な最適化信号のマイニングが可能となった。難解な数学推論と質問応答タスクによる実験結果から、LambdaPOはベースラインメソッドよりもパフォーマンスが向上することが示された。

論文の概要: LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

関連論文リスト