Fugu-MT 論文翻訳(概要): RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

論文の概要: RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

arxiv url: http://arxiv.org/abs/2510.00911v1
Date: Wed, 01 Oct 2025 13:53:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.595829
Title: RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training
Title（参考訳）: RiskPO: LLMポストトライニングのための検証リワードによるリスクベースの政策最適化
Authors: Tao Ren, Jinyang Jiang, Hui Yang, Wan Tian, Minhao Zou, Guanghao Li, Zishi Zhang, Qinghao Wang, Shentao Qin, Yanjun Zhao, Rui Tao, Hui Shao, Yijie Peng,
Abstract要約: 検証可能な報酬を伴う強化学習は、大規模言語モデル(LLM)の訓練後の中心パラダイムとして浮上している。これらの問題は、稀だが有意義な推論パスを無視しながら、高確率な出力シーケンスを過度に強調することに起因すると我々は主張する。本稿では,古典的平均的目標を原則的リスク対策に置き換えるリスクベース政策最適化(RiskPO)を提案する。
参考スコア（独自算出の注目度）: 13.309653291779233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from entropy collapse and limited reasoning gains. We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths. To address these challenges, we propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures. Specifically, we introduce a Mixed Value-at-Risk objective that integrates weighted attention over multiple regions of the reward distribution, thereby amplifying gradient signals on challenging instances and preventing overconfident convergence. We further design a bundling scheme that aggregates multiple questions into bundles, thus enriching the feedback signal and yielding more stable and informative training dynamics. Theoretically, we prove that the risk-averse update alleviates entropy collapse and promotes exploration. Numerically, RiskPO achieves consistent and significant improvements in mathematical reasoning, multi-modal reasoning, and code generation benchmarks, surpassing GRPO and its variants on both Pass@1 and Pass@k metrics. Our results demonstrate that risk-based optimization provides a rigorous and effective paradigm for enhancing LLM reasoning capabilities.
Abstract（参考訳）: 検証可能な報酬を伴う強化学習は、最近、大規模言語モデル(LLMs)の訓練後の中心的なパラダイムとして登場したが、グループ相対政策最適化(GRPO)のような平均的手法では、エントロピーの崩壊と限定的な推論ゲインに悩まされている。これらの問題は、稀だが有意義な推論パスを無視しながら、高確率な出力シーケンスを過度に強調することに起因すると我々は主張する。これらの課題に対処するため,古典的平均的目標を原則的リスク対策に置き換えるリスクベースの政策最適化(RiskPO)を提案する。具体的には、報酬分布の複数の領域に重み付けされた注意を統合し、挑戦するインスタンスの勾配信号を増幅し、過信収束を防止できる混合値-アット・リスクの目的を導入する。さらに、複数の質問をバンドルに集約し、フィードバック信号を強化し、より安定かつ情報的なトレーニングダイナミクスを提供するバンドル・スキームを設計する。理論的には、リスク逆更新はエントロピー崩壊を緩和し、探索を促進する。 RiskPOは数学的推論、マルチモーダル推論、コード生成ベンチマークの一貫性と大幅な改善を実現している。この結果から,リスクベースの最適化はLLM推論能力を高めるための厳密で効果的なパラダイムを提供することが示された。

論文の概要: RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

関連論文リスト