Fugu-MT 論文翻訳(概要): SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

論文の概要: SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

arxiv url: http://arxiv.org/abs/2602.01062v1
Date: Sun, 01 Feb 2026 07:13:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.567834
Title: SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning
Title（参考訳）: SetPO: 多様性保存LDM推論のためのセットレベルポリシー最適化
Authors: Chenyi Li, Yuan Zhang, Bo Wang, Guoqing Ma, Wei Tang, Haoyang Huang, Nan Duan,
Abstract要約: 本稿では,カーネル化類似性を用いたサンプル軌道上で定義された設定レベル多様性の目的について紹介する。提案手法は,各サンプル軌跡に対する余剰余剰貢献を導出し,この目的を政策最適化のためのプラグイン・アドバンテージ・シェーピング用語として統合する。様々なモデルスケールで実験を行い、提案アルゴリズムの有効性を示し、様々なベンチマークでPass@1とPass@Kの双方において、強いベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 50.93295951454092
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishing-returns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
Abstract（参考訳）: 検証可能な報酬を伴う強化学習は、特に数学タスクにおいて、大規模言語モデル(LLM)推論性能の向上に顕著な効果を示した。しかし、このような改善はしばしば結果の多様性を減少させ、モデルが狭い解集合に確率質量を集中させる。帰納原理の減少を動機として,カーネル化類似性を用いたサンプル軌道上で定義された設定レベルの多様性目標を導入する。提案手法は,各サンプル軌跡に対する余剰余剰貢献を導出し,この目的を政策最適化のためのプラグイン・アドバンテージ・シェーピング用語として統合する。さらに,分散摂動フレームワークにおける言語モデルの多様性に対する単一軌道の寄与について検討する。この分析は理論上、単調性の性質を確認し、希少な軌道がグローバルな多様性に一貫して高い限界寄与をもたらすことを証明している。様々なモデルスケールにわたる大規模な実験は、提案アルゴリズムの有効性を示し、様々なベンチマークにおいて、Pass@1とPass@Kの双方において、強いベースラインを一貫して上回っている。

論文の概要: SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

関連論文リスト