Fugu-MT 論文翻訳(概要): Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

論文の概要: Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

arxiv url: http://arxiv.org/abs/2508.13755v1
Date: Tue, 19 Aug 2025 11:51:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.907637
Title: Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
Title（参考訳）: RLVRにおける奥行きの相乗効果:適応探索によるLLM推論ゲインのアンロック
Authors: Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang,
Abstract要約: Reinforcement Learning with Verifiable Reward (RLVR)は、大規模言語モデルにおける推論能力をアンロックするための強力なパラダイムとして登場した。 RLVRの完全なポテンシャルは、モデルがサンプリングできる最も難しい深さと、単一のイテレーションで消費されるインスタンスの数という2つの未探索の次元によって妨げられていることを示す。本稿では,多段ロールアウトによる難易度再重み付けを行うDARS(Difficulty Adaptive Rollout Smpling)を提案する。
参考スコア（独自算出の注目度）: 52.768671969513164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO's mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Reward)は、大規模言語モデルで推論機能をアンロックするための強力なパラダイムとして登場したが、その完全なポテンシャルは、探索されていない2つの次元によって妨げられている。累積アドバンテージは標本を中程度の精度で重み付けする一方で、推論境界を推し進める上で重要な低精度のインスタンスを重み付けする。深度無視を是正するために,多段ロールアウトを対象とするハード問題を再重み付けするDifficulty Adaptive Rollout Smpling (DARS)を導入する。実のところ、ロールアウトサイズを大きくするだけでもコンバージェンスを加速し、Pass@Kを損なう。対照的に、DARSは収束時に余分な推論コストなしで、一貫したPass@Kゲインを提供する。探索の深さを適応的に拡大するのと同じように、トレーニングデータの幅を積極的にスケーリングすることで、推論の利益をさらに増幅できるかどうかを問うようになりました。この目的のために、バッチサイズを大幅に拡大し、PPOのミニバッチイテレーションを複数のエポック上でフルバッチ更新に置き換えました。幅の増大はPass@1パフォーマンスを大幅に向上させる。大きなブレッドストレーニングは高いトークンレベルのエントロピーを維持し、継続的な探索と勾配雑音の低減を示す。さらに,DARSを広い幅で拡張するDARS-Bを提示し,Pass@KとPass@1の同時利得を示す。その結果,RLVRの正交次元として深度を横断する広帯域・適応探索が機能し,RLVRの推論力を解き放つ鍵となることが確認された。

論文の概要: Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

関連論文リスト