Fugu-MT 論文翻訳(概要): Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

論文の概要: Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

arxiv url: http://arxiv.org/abs/2602.04380v1
Date: Wed, 04 Feb 2026 10:01:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-05 19:45:11.465499
Title: Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning
Title（参考訳）: KLの多様性を超えて: LLM推論のためのフレキシブルなブレグマンダイバージェンスによるポリシー最適化
Authors: Rui Yuan, Mykola Khandoga, Vinay Kumar Sankarapu,
Abstract要約: Group-Based Mirror Policy Optimization (GBMPO)は、グループベースのポリシー最適化をフレキシブルなBregman分散に拡張するフレームワークである。ハンドデザインのProbL2-GRPOは86.7%の精度でDr. GRPOベースラインよりも5.5ポイント向上している。
参考スコア（独自算出の注目度）: 3.259050650999544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and training dynamics, all existing group-based methods exclusively use KL divergence for policy regularization, leaving the choice of divergence function unexplored. We introduce Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based policy optimization to flexible Bregman divergences, including hand-designed alternatives (L2 in probability space) and learned neural mirror maps. On GSM8K mathematical reasoning, hand-designed ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over the Dr. GRPO baseline. On MBPP code generation, neural mirror maps reach 60.1-60.8% pass@1, with random initialization already capturing most of the benefit. While evolutionary strategies meta-learning provides marginal accuracy improvements, its primary value lies in variance reduction ($\pm$0.2 versus $\pm$0.6) and efficiency gains (15% shorter responses on MBPP), suggesting that random initialization of neural mirror maps is sufficient for most practical applications. These results establish divergence choice as a critical, previously unexplored design dimension in group-based policy optimization for LLM reasoning.
Abstract（参考訳）: Group Relative Policy Optimization (GRPO) やその変種といった政策最適化手法は、数学的推論やコード生成タスクにおいて大きな成果を上げている。報酬処理戦略やトレーニング力学の広範な探索にもかかわらず、既存のグループベースの手法はすべてKL分散を政策正規化にのみ用いており、分岐関数の選択は未探索のままである。我々は,グループベースの政策最適化をフレキシブルなブレグマン分岐に拡張するフレームワークであるGBMPO(Group-Based Mirror Policy Optimization)を紹介した。 GSM8Kの数学的推論では、手設計のProbL2-GRPOは86.7%の精度を実現し、Dr. GRPOベースラインよりも+5.5ポイント向上した。 MBPPコード生成では、ニューラルネットワークミラーマップが 60.1-60.8% pass@1 に達する。進化的戦略のメタラーニングは限界精度の向上をもたらすが、その主な価値は分散還元($\pm$0.2 vs $\pm$0.6)と効率向上(MBPPでは15%短い応答)にあり、ほとんどの実用的な応用においてニューラルネットワークマップのランダム初期化が十分であることを示している。これらの結果は、LSM推論のためのグループベースのポリシー最適化において、決定的かつ未探索な設計次元として分岐選択が確立される。

論文の概要: Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

関連論文リスト