Fugu-MT 論文翻訳(概要): RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

論文の概要: RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

arxiv url: http://arxiv.org/abs/2606.01281v1
Date: Sun, 31 May 2026 15:06:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.494374
Title: RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
Title（参考訳）: 有効でないサンプルのないRLVR:LLM推論のためのグループ優先オフポリティ最適化
Authors: Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji,
Abstract要約: Group Prioritized Off-Policy Optimization (POPO)は、ロールアウトオーバーヘッドを発生させることなく、効果的なトレーニングバッチを活用するフレームワークである。 POPOは2つの重要なコンポーネントで構成されている。 POPOはRL微細化を著しく加速し、ロールアウトを著しく少なくして強力な推論性能を達成する。
参考スコア（独自算出の注目度）: 49.04912820721943
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints. To address these limitations, we propose Group Prioritized Off-Policy Optimization (POPO), a simple yet effective framework that fully exploits effective training batches without additional rollout overhead. POPO comprises two key components: prioritized group replay and decoupled off-policy optimization. The former replaces ineffective on-policy groups with effective off-policy groups via a recency-based replay mechanism that jointly considers sample quality and the degree of off-policiness. To further mitigate the off-policy gap, POPO employs decoupled importance sampling to correct off-policy bias while maintaining stable policy updates under consistent trust-region constraints. Empirical evaluations across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that POPO substantially accelerates RL finetuning and achieves strong reasoning performance with significantly fewer rollouts.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は,大規模言語モデル(LLM)の推論能力を高めるための強力なパラダイムとして登場した。しかし、その効果は、非効率なトレーニングデータの頻度によって著しく妨げられ、多くのサンプリングされたプロンプトは、完全に正しいか、完全に間違っている応答群を出力し、結果、非分散報酬と限られた学習信号をもたらす。近年の最先端の手法では、LLMのロールアウトによって非効率なサンプルをフィルタするが、計算オーバーヘッドがかなり大きい。予測サンプリングやトラジェクトリ・リプレイを含む別のアプローチは、データの効率を改善することを目的としているが、しばしば不十分であり、体系的バイアスや準最適制約のような追加の問題を導入する可能性がある。これらの制限に対処するために、ロールアウトオーバーヘッドを伴わずに効果的なトレーニングバッチを完全に活用する、シンプルで効果的なフレームワークであるグループ優先順位付きオフ・ポリシー最適化(POPO)を提案する。 POPOは2つの重要なコンポーネントで構成されている。前者は、サンプルの品質と非政治性の度合いを共同で考慮した、リレーシベースのリプレイ機構を通じて、効果的な非政治グループと効果的な非政治グループを置き換える。政治外のギャップをさらに緩和するため、POPOは、一貫した信頼領域制約の下で安定した政策更新を維持しながら、政治外のバイアスを修正するために、分離された重要サンプリングを採用する。数学、計画、視覚幾何学を含む多種多様な推論タスクに対する実証的な評価は、POPOがRL微調整を著しく加速し、ロールアウトを大幅に減らして強い推論性能を達成することを示した。

論文の概要: RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

関連論文リスト