Fugu-MT 論文翻訳(概要): Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

論文の概要: Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

arxiv url: http://arxiv.org/abs/2605.28184v1
Date: Wed, 27 May 2026 09:07:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.918519
Title: Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration
Title（参考訳）: 最適係数校正による強化学習における多段階予測の共同訓練
Authors: Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin,
Abstract要約: MTP(Multi-Token Prediction)は、事前トレーニングにおいて広く採用されているモジュールである。 RL目標に対するMPPの段差効果は,第1次相関と第2次ペナルティの2つの項に分解できることを示す。本稿では,ログ確率プロキシを用いて最適係数を無視可能なコストでオンラインで追跡する適応型手法を提案する。
参考スコア（独自算出の注目度）: 61.46060073417047
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.
Abstract（参考訳）: Reinforcement Learning from Verifiable Rewards (RLVR) は、大規模言語モデルの推論能力向上のための標準パラダイムとして登場し、Multi-Token Prediction (MTP) は事前学習において広く採用されているモジュールである。それらの組み合わせは自然なアプローチであるが、現在のRLは、共同トレーニングがパフォーマンスを劣化させるため、MPP勾配を分離する。最適化の観点から、この失敗を再考する。 RL目標に対するMPPの段差効果は,第1次相関と第2次摂動ペナルティの2つに分解できることを示す。この分解は、デタック、クロスエントロピー損失、政策損失の3つのMTPトレーニング体制を統一し、それぞれが成功したか失敗したかを説明する。政策損失のさらなる分析により、直観と一致しているが、性能は依然として低下し、相関項は崩壊し、二次的なペナルティは持続することが明らかになった。そこで本研究では,最適係数のオンライン追跡手法であるOCC(Optimal Coefficient Calibration)を提案する。 6つの競合レベルの数学的推論ベンチマークで、OCCは一貫してデタッハベースラインを上回り、共同MTP-RL訓練性能が向上した。

論文の概要: Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

関連論文リスト