Fugu-MT 論文翻訳(概要): On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

論文の概要: On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

arxiv url: http://arxiv.org/abs/2605.06523v1
Date: Thu, 07 May 2026 16:30:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.996662
Title: On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
Title（参考訳）: RLVRにおけるインシシット・リワードオーバーフィッティングと低ランクダイナミクスについて
Authors: Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua,
Abstract要約: RLVRはトレーニングデータセットに過度に適合する暗黙の報酬を示す可能性がある。モデルは、トレーニングプロセス中に報酬が比較的低いままであっても、テストセット上で満足なパフォーマンスを達成することができる。
参考スコア（独自算出の注目度）: 51.935533482549545
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.
Abstract（参考訳）: 近年の研究では、RLVR(Reinforcement Learning with Verifiable Rewards)によるモデルによる推論能力の強化が、主にランク1のコンポーネントに集中していることが示されている。 RLVRはトレーニングデータセットに過度に適合する暗黙の報酬を示す可能性がある。具体的には、トレーニングプロセス中に報酬が比較的低いままであっても、テストセット上で満足なパフォーマンスを達成することができる。 1)RLVRにおける有効ランク1成分は,数学的推論能力以外のモデル知識を保持できない。 2) RLVR は特定の特異スペクトルを最適化することによって基本的に機能する。 RLVR学習モデルにおけるほぼすべての線形層の特異値の分布は、重み付き分布のように振る舞う。 3) ランク1成分に付随する左特異ベクトルはトレーニング中により強いアライメント傾向を示し, RLVRがサンプリング効率を本質的に最適化しているという発見を反映している。今回得られた知見と分析により,RLVRがモデルパラメータをどのように形成し,既存のRLパラダイムや他のトレーニングパラダイムを改良し,継続的な学習を実現するための潜在的洞察を提供する。

論文の概要: On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

関連論文リスト