Fugu-MT 論文翻訳(概要): Understanding Diversity Collapse in RLVR via the Lens of Overtraining

論文の概要: Understanding Diversity Collapse in RLVR via the Lens of Overtraining

arxiv url: http://arxiv.org/abs/2606.15455v1
Date: Sat, 13 Jun 2026 20:13:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:33.583698
Title: Understanding Diversity Collapse in RLVR via the Lens of Overtraining
Title（参考訳）: オーバートレーニングレンズによるRLVRの多様性崩壊の理解
Authors: Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An,
Abstract要約: 検証可能な報酬付き強化学習(RLVR)は,大規模言語モデルの推論能力を高めるための重要なアプローチとなっている。我々は、この多様性の崩壊をエンフェーバートレーニングのレンズを通してフォーマルに定式化する本稿では,各問題の限界寄与を推論境界に推定することにより,オーバートレーニングから最適化をリダイレクトするemphBayesian boundary Gating (BBG)を提案する。
参考スコア（独自算出の注目度）: 78.37408098404312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while high-$k$ Pass@$k$ degrades, which is viewed as a narrowing of the model's reasoning boundary. We formalize this diversity collapse through the lens of \emph{overtraining}: once a problem's contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-$k$ Pass@$k$, so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model's reasoning abilities beyond the base model: since RLVR is structurally biased against high-$k$ Pass@$k$, its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@$256$ above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose \emph{Bayesian Boundary Gating} (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@$k$ across a wide range of $k$.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は,大規模言語モデルの推論能力を高めるための重要なアプローチとなっている。しかし、RLVRは、しばしば \emph{diversity collapse} に苦しむ: Pass@$1$は改善され、 High-$k$ Pass@$k$ degrades は、モデルの推論境界を狭めるものと見なされる。参照計量への問題の寄与が効果的に飽和すると、さらなる更新はモデルが解決できる範囲を広げることなく、オンラインサンプリングで好まれる軌道に確率質量を集中させる。問題ごとのロールアウトがほとんどない標準設定の下では、単一の成功例でさえ、高い$k$ Pass@$k$のほぼ飽和した状態に問題を置いているため、標準RLVRのほとんどの更新は境界面から過度にトレーニングされている。この視点はまた、RLVRがベースモデルを超えてモデルの推論能力を拡張することができるかどうかについても言及している。興味深いことに、観測された成功率ゼロの問題の更新を制限することは、難しいベンチマークでベースモデルより上位にPass@$256$を持ち上げる。これらの結果に基づいて,各問題の限界的寄与を推論境界に推定することにより,オーバートレーニングから最適化をリダイレクトする「emph{Bayesian boundary Gating} (BBG)」を提案する。複数の推論ベンチマークを通じて、BBGは平均的なPass@$k$を、幅広い$k$で改善する。

論文の概要: Understanding Diversity Collapse in RLVR via the Lens of Overtraining

関連論文リスト