Fugu-MT 論文翻訳(概要): From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

論文の概要: From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

arxiv url: http://arxiv.org/abs/2604.14142v1
Date: Wed, 15 Apr 2026 17:59:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.670881
Title: From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
Title（参考訳）: P(y|x)$から$P(y)$: プレトレイン空間における強化学習の調査
Authors: Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu,
Abstract要約: 我々は、P(y)に直接報酬駆動オンライン更新を適用するPre-train Space RL(Pre-train Space RL)を紹介する。 PreRL内の負のサンプル強化(NSR)は、推論のための非常に効果的なドライバとして機能します。そこで我々は,NSR-PreRLでモデルの初期化を図った政策再導入戦略であるDual Space RL (DSRL)を提案する。
参考スコア（独自算出の注目度）: 38.33074456644293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
Abstract（参考訳）: 検証可能な報奨(RLVR)による強化学習は条件分布P(y|x)を最適化することによりLLM推論を著しく向上させるが、そのポテンシャルはベースモデルの既存の出力分布によって根本的に制限される。プリトレイン空間における限界分布P(y)の最適化は、推論能力の符号化と広い探索能力の保存により、このボトルネックに対処する。しかし、従来の事前学習は受動的学習のための静的コーパスに依存しており、ターゲット推論の強化を妨げる分布シフトにつながっている。本稿では,P(y) に直接報酬駆動型オンライン更新を適用する PreRL (Pre-train Space RL) を紹介する。理論的および実験的に、log P(y) と log P(y|x) の強い勾配アライメントを検証し、PreRL を標準 RL のサロゲートとして確立する。 PreRL内の負のサンプル強化(NSR)は、推論のための非常に効果的なドライバとして機能します。 NSR-PreRLは、内因性反射の振る舞いを刺激し、遷移と反射の思考をそれぞれ14.89xと6.54xに増加させながら、誤った推論空間を急速に引き起こす。これらの知見を生かして、NSR-PreRLでモデルの初期化を行い、より詳細な最適化のために標準RLに移行する前に、推論の地平を広げる政策再導入戦略であるDual Space RL(DSRL)を提案する。大規模な実験により、DSRLは強い基底線を一貫して上回り、プレトレイン空間のプルーニングが、洗練された正しい推論部分空間への方針を効果的に決定することを示した。

論文の概要: From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

関連論文リスト