Fugu-MT 論文翻訳(概要): FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

論文の概要: FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

arxiv url: http://arxiv.org/abs/2509.18362v1
Date: Tue, 16 Sep 2025 07:36:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.550579
Title: FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
Title（参考訳）: FastMTP: マルチトークン予測の強化によるLCM推論の高速化
Authors: Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, Xi Chen,
Abstract要約: 本稿では,MTPトレーニングを推論パターンに整合させることで,多段階のドラフト品質を向上させるFastMTPを提案する。我々のアプローチは、自己蒸留データに位置共有重みを付加した単一のMPPヘッドを微調整することで、連続した将来のトークン間の依存関係をキャプチャすることができる。 7つの異なるベンチマークによる実験結果から、FastMTPは標準的な次のトークン予測と比較して平均2.03倍のスピードアップを達成することが示された。
参考スコア（独自算出の注目度）: 11.691960175716163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This paper introduces FastMTP, a simple yet effective method that improves multi-step draft quality by aligning MTP training with its inference pattern, significantly enhancing speculative decoding performance. Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens and maintain high acceptance rates across multiple recursive draft steps. By integrating language-aware dynamic vocabulary compression into the MTP head, we further reduce computational overhead in the drafting process. Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%. FastMTP requires only lightweight training and seamlessly integrates with existing inference frameworks, offering a practical and rapidly deployable solution for accelerating LLM inference.
Abstract（参考訳）: 大規模言語モデル(LLM)がますます強力になるにつれて、自動回帰生成のシーケンシャルな性質は、実際のデプロイメントを制限する基本的なスループットボトルネックを生み出します。 MTP(Multi-Token Prediction)は、モデルトレーニングの効率と性能に顕著な利点を示してきたが、推論アクセラレーションの本質的な可能性はほとんど解明されていない。本稿では,MTPトレーニングを推論パターンに整合させることにより,多段階のドラフト品質を向上し,投機的復号性能を大幅に向上させる,シンプルで効果的なFastMTPを提案する。提案手法は, 自己蒸留データに対する位置共有重み付き単一MPPヘッドを微調整し, 連続した将来のトークン間の依存関係を捕捉し, 複数の再帰的ドラフトステップにおける高い受け入れ率を維持する。言語対応動的語彙圧縮をMPPヘッドに統合することにより、起草プロセスにおける計算オーバーヘッドをさらに削減する。 7つのベンチマークによる実験結果から、FastMTPは標準的な次のトークン予測よりも平均2.03倍のスピードアップを達成し、損失のない出力品質を達成し、バニラMTPを82%上回った。 FastMTPは軽量なトレーニングのみを必要とし、既存の推論フレームワークとシームレスに統合する。

論文の概要: FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

関連論文リスト