Fugu-MT 論文翻訳(概要): Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

論文の概要: Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

arxiv url: http://arxiv.org/abs/2605.11461v2
Date: Mon, 18 May 2026 07:36:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:45.606737
Title: Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
Title（参考訳）: breaking $\textit{Winner-Takes-All}$: 協調的ポリシー最適化による横型LLM推論の改善
Authors: Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu,
Abstract要約: グループ協力政策最適化は、トレーニングパラダイムをロールアウト競争からチーム協力へとシフトさせる。 GCPOは独立したロールアウトスコアをチームレベルのクレジット割り当てに置き換える。チームへの平均的な限界貢献に従って、各ロールアウトに対して、グループチームの報酬を再分配する。
参考スコア（独自算出の注目度）: 53.42577591449649
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.
Abstract（参考訳）: 検証器を用いた強化学習(RLVR)は、LLM推論を改善するための中心的なパラダイムとなっているが、GRPOのような一般的なグループベースの最適化アルゴリズムは、探索崩壊に悩まされることが多い。近年の取り組みは、エントロピーの正規化や多様性のボーナスを追加することでこれを緩和しようとしている。しかしながら、これらのアプローチは、グローバルな多様性を最大化するために協力するよりも、ロールアウトが個人の優位性を競うような、textit{winner-takes-all} の性質を変えない。本研究では,トレーニングパラダイムをロールアウト競争からチーム協力に移行するグループ協調政策最適化(GCPO)を提案する。特にGCPOは、独立したロールアウトスコアをチームレベルのクレジット割り当てに置き換えます。このカバレッジは、報酬重み付けされたセマンティック埋め込みよりも決定的なボリュームとして説明され、正しいロールアウトと非冗長ロールアウトだけがこのボリュームに寄与する。有利な見積もりでは、GCPOは、チームへの平均的な限界貢献に従って、各ロールアウトに対する集団チームの報酬を再分配します。この協調訓練パラダイムは、非冗長な正しい推論経路へ最適化する。複数の推論ベンチマークによる実験により、GCPOは既存のアプローチよりも推論精度と解の多様性の両方を著しく改善することが示された。コードはhttps://github.com/bradybuddiemarch/gcpo.comでリリースされる。

論文の概要: Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

関連論文リスト