Fugu-MT 論文翻訳(概要): How Far Can Unsupervised RLVR Scale LLM Training?

論文の概要: How Far Can Unsupervised RLVR Scale LLM Training?

arxiv url: http://arxiv.org/abs/2603.08660v1
Date: Mon, 09 Mar 2026 17:38:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:16.614939
Title: How Far Can Unsupervised RLVR Scale LLM Training?
Title（参考訳）: RLVRスケールのLLMトレーニングはどこまでできるのか?
Authors: Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning Ding,
Abstract要約: 検証可能な報酬を伴う教師なし強化学習(URLVR)は、監督ボトルネックを越えてLLMトレーニングをスケールするための経路を提供する。最近の研究は、モデル固有の信号を活用し、期待できる早期の利得を示しているが、その可能性と限界は未だ不明である。我々は、URLVRメソッドを報酬源に基づく固有対外部に分類し、統一された理論的枠組みを確立する。
参考スコア（独自算出の注目度）: 57.44753418846446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
Abstract（参考訳）: 検証可能な報酬付き教師なし強化学習(URLVR)は、根拠となる真理ラベルを使わずに報酬を導出することにより、監督ボトルネックを越えてLLMトレーニングをスケールするための経路を提供する。最近の研究は、モデル固有の信号を活用し、期待できる早期の利得を示しているが、その可能性と限界は未だ不明である。本稿では,URLVRを再考し,分類学,理論,広範な実験を対象とする包括的分析を行う。まず、報酬源に基づいてURLVRメソッドを内在的・外在的に分類し、次に、すべての内在的手法がモデルの初期分布を鋭くするために収束することを明らかにする統一的理論的枠組みを確立する。系統的な実験を通して、本質的な報酬は、工学的な選択よりもモデルによって決定される崩壊のタイミングで、メソッド間のアップ・ザ・フォールパターンを一貫して従うことを示す。これらのスケーリング制限にもかかわらず、小さなデータセットでのテストタイムトレーニングにおいて本質的な報酬は依然として有用であり、モデルの事前測定のためにモデル崩壊ステップを提案し、RLトレーニング可能性の実践的な指標として機能する。最後に,計算アシンメトリーの地上検証を行う外部報酬法について検討し,信頼度天井から逃れる可能性のある予備的証拠を示す。提案手法は,拡張性のある代替手段への道のりを動機づけつつ,固有のURLVRの境界をグラフ化した。

論文の概要: How Far Can Unsupervised RLVR Scale LLM Training?

関連論文リスト