Fugu-MT 論文翻訳(概要): BroRL: Scaling Reinforcement Learning via Broadened Exploration

論文の概要: BroRL: Scaling Reinforcement Learning via Broadened Exploration

arxiv url: http://arxiv.org/abs/2510.01180v1
Date: Wed, 01 Oct 2025 17:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.719747
Title: BroRL: Scaling Reinforcement Learning via Broadened Exploration
Title（参考訳）: BroRL: 拡張探索による強化学習のスケールアップ
Authors: Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, Yi Dong,
Abstract要約: RLVR(Reinforcement Learning with Verifiable Rewards)は、大規模言語モデルにおいて複雑な推論能力を解き放つ鍵となる要素として登場した。最近のProRLは、トレーニングステップの数を増やすことで、RLのスケーリングを約束している。 RL, BroR-Lineasing the followingary paradigm for scaling RL, BroR-Lincreasing the rollouts per example to hundreds。
参考スコア（独自算出の注目度）: 88.69554867685243
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Rewards)は、大規模言語モデルにおいて複雑な推論能力を解き放つ鍵となる要素として登場した。最近のProRLは、トレーニングステップの数を増やすことで、RLのスケーリングを約束している。しかし、数千ステップの後にパフォーマンスが低下し、より多くの計算を割り当てることから追加のトレーニングまで、明確なリターンが低下する。本研究では,RLをスケールする際の相補的パラダイムであるBroR-Lineasing the number of rollouts to hundreds to outively Broaden explorationについて検討する。我々のアプローチは質量収支方程式解析によって動機付けられており、補強過程における正誤トークンに対する確率質量の変化率を特徴付けることができる。一段階のRL仮定の下では、サンプリングされたロールアウトトークンは常に正しい質量拡大に寄与し、一方、ロールアウト外のアンサンプされたトークンは、その分布と純報酬バランスによって利得または損失をもたらす可能性がある。重要なことに、サンプルN当たりのロールアウト数が増加するにつれて、アンサンプ項の効果は減少し、全体的な正しい質量膨張が保証される。理論解析の妥当性を検証するため, より緩和された条件下でシミュレーションを行い, 十分に大きなロールアウトサイズのNが, 全ての正当なトークンの確率質量を増大させることを示す。 BroRLは、3K ProRLトレーニングステップ後に飽和したモデルを復元し、堅牢で継続的な改善を示し、様々なベンチマークで1.5Bモデルの最先端結果を達成する。

論文の概要: BroRL: Scaling Reinforcement Learning via Broadened Exploration

関連論文リスト