Fugu-MT 論文翻訳(概要): Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

論文の概要: Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

arxiv url: http://arxiv.org/abs/2606.12370v1
Date: Wed, 10 Jun 2026 17:36:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.598067
Title: Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
Title（参考訳）: ブレークングエントロピー境界:リジェクションサンプリングを用いたMPPによるRLトレーニングの高速化
Authors: Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang, Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men, Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou,
Abstract要約: 強化学習(RL)は、現代の大規模言語モデルにおいて重要なコンポーネントとなっているが、ロールアウトステージは、RLトレーニングパイプラインにおける重要なボトルネックであり続けている。 MTP(Multi-Token Prediction)は投機的復号化によってロールアウトを加速する自然な解を提供するが、多くの研究で、MTPの受入率がRLトレーニング中に著しく低下することが観察されている。本稿では,LLMポストトレーニングにおけるMPPの体系的研究であるBebopを紹介し,大規模なRLパイプラインにMPPを統合するための実践的なレシピを提供する。
参考スコア（独自算出の注目度）: 87.16803442525755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.
Abstract（参考訳）: 強化学習(RL)は、現代の大規模言語モデルにおいて重要なコンポーネントとなっているが、ロールアウトステージは、RLトレーニングパイプラインにおける重要なボトルネックであり続けている。 MTP(Multi-Token Prediction)は投機的復号化によってロールアウトを高速化する自然な解を提供するが、多くの研究で、MTPの受入率はRLトレーニング中に大幅に低下し、スピードアップ性能が制限される。このボトルネックに対処するために,LLMポストトレーニングにおけるMPPの体系的研究であるBebopを紹介し,大規模なRLパイプラインにMPPを統合するための実践的なレシピを提供する。まず, モデルエントロピーの揺らぎにより, MTP の受容速度が基本的に拘束され, RL 段階におけるエントロピーの上昇と負の線形関係が明らかになることを示した。第2に,確率的拒絶サンプリングは,greedyドラフトサンプリングと比較して,RLのエントロピーによって引き起こされる障害を大幅に軽減することを示した。さらに,従来のMPPトレーニング目標(クロスエントロピーやKL)が,このような設定で最適であることを示すとともに,マルチステップ拒否サンプリングの受け入れ率を直接最適化し,約10%の受け入れ率向上を実現し,最大95%の受け入れ率と最大25%の推論スループット向上を実現し,数学的推論,コード生成,エージェントタスクを対象とする,新たなエンドツーエンドTV損失を提案する。第3に,RL における様々なオンライン MTP トレーニング戦略を検証し,e2e TV ロスとリジェクションサンプリングによる事前 RL MTP トレーニングが,RL 全体を通して一貫した受容率とスピードアップを実現し,コストのかかるオンライン MTP 更新の必要性を排除していることを示す。得られた知見を検証するための広範な実験と分析を行う。実験結果から,Qwen3.5,Qwen3.6,Qwen3.7モデルの非同期RLトレーニングにおいて,最大1.8倍のエンドツーエンド加速を実現した。

論文の概要: Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

関連論文リスト