Fugu-MT 論文翻訳(概要): The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

論文の概要: The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

arxiv url: http://arxiv.org/abs/2510.04028v1
Date: Sun, 05 Oct 2025 04:31:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.408754
Title: The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View
Title（参考訳）: RLVR推論能力境界に関する議論: 収縮, 膨張, あるいは両方? 2段階のダイナミックビュー
Authors: Xinhao Yao, Lu Yu, Xiaolin Hu, Fengwei Teng, Qing Cui, Jun Zhou, Yong Liu,
Abstract要約: 検証可能な報酬(RLVR)による強化学習は、大規模言語モデル(LLM)の推論能力を拡大または縮小するいくつかの研究では、RLVRは主にサンプリング効率を改善するが、多様性と探索能力が犠牲になり、能力境界が縮小すると主張している。また、長期トレーニングが新たな推論戦略の出現に繋がる可能性を示し、能力境界の拡張を示唆している。
参考スコア（独自算出の注目度）: 37.56564205666228
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ongoing debate on whether reinforcement learning with verifiable rewards (RLVR) expands or shrinks the reasoning capabilities of large language models (LLMs) remains unresolved. Some studies contend that RLVR mainly improves sampling efficiency but at the expense of diversity and exploratory capacity, resulting in capability boundary shrinkage. In contrast, others demonstrate that prolonged training can lead to the emergence of novel reasoning strategies, suggesting capability boundary expansion. To reconcile these contradictory findings, we theoretically and empirically show that both perspectives are partially valid-each aligning with a separate phase in an inherent two-stage probability mass dynamic: (1) Exploitation stage: initially, the model primarily samples explored high-reward and low-reward tokens, while rarely selecting the potentially optimal token. Positive advantage estimates increase the probability of high-reward tokens and decrease those of low-reward tokens, yet the optimal token's probability remains largely unchanged during this stage. (2) Exploration stage: as training advances, the growth rate of previously acquired high-reward tokens slows as their probabilities approach saturation. When a potentially optimal token-now receiving positive advantage estimates-is occasionally sampled, its probability increases, while those of the originally high-reward tokens decrease. This dynamic suggests that over-exploitation during the exploitation stage may lead to capability boundary shrinkage, whereas prolonged training into the exploration stage can promote an expansion of the reasoning capability boundary. Building upon our insights, we revisit the potential of only using relative negative gradients for prolonging training, providing a theoretical and empirical foundation for the development of more advanced reasoning capabilities.
Abstract（参考訳）: 検証可能な報酬(RLVR)による強化学習が大規模言語モデル(LLM)の推論能力を拡大または縮小するかどうかの議論は未解決のままである。いくつかの研究では、RLVRは主にサンプリング効率を改善するが、多様性と探索能力が犠牲になり、能力境界が縮小すると主張している。対照的に、長期トレーニングが新たな推論戦略の出現に繋がる可能性を示すものもあり、能力境界の拡張が示唆されている。これらの矛盾する発見を再現するために、我々は理論上、実験的に、両視点が本質的に2段階の確率質量力学において異なる位相に部分的に整合していることを示す:(1)爆発段階: 当初、モデルが主に高逆および低逆トークンを探索し、潜在的に最適トークンを選択することは稀である。正の利点推定は、高利回りトークンの確率を増大させ、低利回りトークンの確率を減少させるが、最適トークンの確率はこの段階で大きく変化しない。 2) 探究段階: 訓練が進むにつれて, その確率が飽和に近づくにつれて, 以前取得した高利回りトークンの成長速度が低下する。潜在的に最適なトークンを受信する正の利点の推定値が時折サンプリングされると、その確率は増加し、元々の高逆トークンの確率は減少する。このダイナミクスは、エクスプロイト段階における過剰な露光が能力境界の収縮を引き起こす可能性を示唆する一方で、探索段階における長期トレーニングは推論能力境界の拡張を促進することを示唆している。我々の知見に基づいて、我々は、より高度な推論能力を開発するための理論的かつ実証的な基礎を提供するために、相対的な負の勾配のみを用いることの可能性を再考する。

論文の概要: The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

関連論文リスト