Fugu-MT 論文翻訳(概要): You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

論文の概要: You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

arxiv url: http://arxiv.org/abs/2605.21468v1
Date: Wed, 20 May 2026 17:53:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.828662
Title: You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
Title（参考訳）: 最小限のRLVRトレーニングが必要:Rランク1軌道でLLMを外挿する
Authors: Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng,
Abstract要約: 検証可能な報酬(RLVR)を用いた強化学習は極めて低ランクであり,予測可能性が高いことを示す。本稿では,短時間の観測窓からランク1部分空間を推定する,単純で計算効率のよいRELEXを提案する。注目すべきは、RELEXはトレーニングコストなしで観測窓をはるかに越えることができることだ。
参考スコア（独自算出の注目度）: 23.542887618146988
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、大規模言語モデル(LLM)における推論を改善する主要なパラダイムとなっているが、結果として生じるパラメータの軌跡の基本的な幾何学はいまだ未解明のままである。本研究では,RLVRのウェイトトラジェクトリが極めて低ランクであり,予測可能であることを実証する。具体的には、下流のパフォーマンス向上の大部分はパラメータデルタのランク1近似によって捉えられ、このプロジェクションの大きさはトレーニングステップとともにほぼ直線的に進化する。そこで本研究では,短時間の観測窓からランク-1部分空間を推定し,線形回帰により将来のチェックポイントを外挿する手法であるRELEX(Reinforcement Learning Extrapolation)を提案する。 3つのモデル(Qwen2.5-Math-1.5B、Qwen3-4B-Base、Qwen3-8B-Base)にまたがって、RELEXはドメイン内および外部のベンチマークでRLVRのパフォーマンスに適合またはそれ以上のチェックポイントを生成し、フルRLVRトレーニングの15%のステップしか必要としない。注目すべきは、RELEXは、トレーニングコストなしで、観察窓の遥かに外挿することができ、観察されたプレフィックスを越えて最大10～20$\times$を予測できることだ(例えば、最初の50ステップのみを観察し、1000ステップを外挿する)。我々のアブレーション分析はRELEXの最小限の効率を裏付けるものであり、部分空間ランクの上昇や非線形モデリングの導入は外挿においてさらなる利益をもたらすものではない。最後に、RELEXの成功は、ランク1のサブ空間に更新を投影することで、外挿時の性能を低下させる確率最適化ノイズを解消する「減少」効果に起因していることを示す。私たちのコードはhttps://github.com/weizhepei/RELEX.comで公開されています。

論文の概要: You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

関連論文リスト