Fugu-MT 論文翻訳(概要): Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

論文の概要: Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

arxiv url: http://arxiv.org/abs/2511.20340v1
Date: Tue, 25 Nov 2025 14:20:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-26 17:37:04.499622
Title: Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios
Title（参考訳）: LLM投機的デコーディングのスケールアップ:大規模シナリオにおける非自己回帰予測
Authors: Luohe Shi, Zuchao Li, Lefei Zhang, Baoyuan Qi, Guoming Liu, Hai Zhao,
Abstract要約: 本稿では,一方向および注目メカニズムを加速する新しいアーキテクチャであるSpecFormerを紹介する。また,SpecFormerはトレーニング要求の低減と計算コストの削減を実現している。
参考スコア（独自算出の注目度）: 76.85739138203014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.
Abstract（参考訳）: 投機的復号化は、メモリ・ツー・チップデータ転送中にアイドルな計算資源を利用することで、LCM推論を加速させる。現在の投機的復号法は、通常、利用可能な計算能力のかなりの量を仮定し、小さな自己回帰言語モデルを用いて複雑で大規模なドラフトツリーを生成し、全体的な予測精度を向上させる。しかし、バッチ処理のような手法は、利用可能なアイドルコンピューティングパワーを圧縮するため、投機的復号法よりも優れた代替として主流のモデル推論システムに広く適用されてきた。そのため、低い検証資源と低いスケジューリングコストで投機的復号化を行うことが重要な研究課題となっている。ドラフトシーケンスの並列生成を可能にする、より有能なモデルが、本当に必要なものである、と私たちは信じています。限られた長さのシーケンスのみを生成するためのドラフトモデルの基本的な性質を認識し,一方向および双方向の注意機構を統合した新しいアーキテクチャであるSpecFormerを提案する。 SpecFormerは、自動回帰モデルの入力シーケンス全体から情報を抽出する能力と、非自己回帰モデルの並列生成の利点を組み合わせる。この設計は大きな接頭辞木への依存を排除し、大きなバッチシナリオにおいても一貫した加速を実現する。様々なスケールのモデルの損失のない投機的復号化実験を通じて、SpecFormer が LLM 推論をより少ないトレーニング要求と計算コストでスケールするための新しい標準を設定できることを実証した。

論文の概要: Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

関連論文リスト