Fugu-MT 論文翻訳(概要): Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

論文の概要: Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

arxiv url: http://arxiv.org/abs/2605.04077v1
Date: Tue, 14 Apr 2026 09:48:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 06:56:26.575636
Title: Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
Title（参考訳）: バランスアグリゲーション:GRPOにおけるアグリゲーションバイアスの理解と修正
Authors: Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, Jiashuo Liu, Ziniu Li, Bingrui Li, Yuhao Wu, Yining Zheng, Ge Zhang, Wenhao Huang, Xipeng Qiu,
Abstract要約: 検証可能な報酬(RLVR)による強化学習は、大規模言語モデルにおける推論とコード生成を改善するための中心的なパラダイムとなっている。標準的なGRPOはシーケンスアグリゲーションを使用し、最近の研究はトークンアグリゲーションをより良い代替手段として提唱している。トークンアグリゲーションは符号長結合を導入し、シーケンスアグリゲーションは暗黙的にダウンウェイトを延長する。
参考スコア（独自算出の注目度）: 70.38763678943648
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbf{Balanced Aggregation (BA)}, a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based weights. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris, evaluated on six reasoning and coding benchmarks, show that BA consistently improves training stability and final performance over standard token and sequence aggregation. Our analysis further shows that the relative effectiveness of token and sequence aggregation is largely governed by response-length variation and the positive-negative length gap, highlighting aggregation as a critical design dimension in GRPO-style RLVR.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、大規模言語モデルにおける推論とコード生成を改善するための中心的なパラダイムとなり、GRPOスタイルのトレーニングは、その単純さと有効性のために広く採用されている。しかしながら、トークンレベルのポリシー勾配項が各サンプリンググループ内でどのように集約されるかという重要な設計選択は、まだ未解決のままである。標準的なGRPOはシーケンスアグリゲーションを使用し、最近の研究はトークンアグリゲーションをより良い代替手段として提唱している。トークンアグリゲーションは符号長の結合を導入し、シーケンスアグリゲーションは列レベルの等重み付けによって暗黙的にダウンウェイトを延長する。この緊張に対処するために、トークンレベルの平均を正と負のサブセットで別々に計算し、それらをシーケンス数ベースの重み付けに結合する単純なドロップイン置換法である \textbf{Balanced Aggregation (BA)} を提案する。 DAPO-17kとPolarisのQwen2.5-Math-7BとQwen3-1.7Bを用いた実験は、6つの推論およびコーディングベンチマークで評価され、BAが標準トークンとシーケンスアグリゲーションよりもトレーニング安定性と最終的なパフォーマンスを一貫して改善していることが示されている。さらに,トークンとシーケンスアグリゲーションの相対的有効性は,応答長の変動と正負長の差に大きく左右され,GRPOスタイルのRLVRにおける重要な設計次元としてのアグリゲーションが強調される。

論文の概要: Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

関連論文リスト