Fugu-MT 論文翻訳(概要): NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies

論文の概要: NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies

arxiv url: http://arxiv.org/abs/2605.26444v2
Date: Mon, 01 Jun 2026 08:39:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 07:09:36.494484
Title: NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies
Title（参考訳）: NanoSpec:ミニマリストインコンテキスト語彙を用いた投機的デコーディングの高速化
Authors: Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma,
Abstract要約: NanoSpecは、ドラフトタイムを平均51.6%削減し、1.17$-1.29times$エンドツーエンドのスピードアップを提供する。本稿では,スパースメモリアクセスの非効率性を克服するシステム設計について紹介する。補完的なプラグアンドプレイモジュールとして、NanoSpecはドラフト時間を平均51.6%削減し、1.17$-1.29times$エンドツーエンドのスピードアップを提供する。
参考スコア（独自算出の注目度）: 5.749618977356584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The massive vocabulary sizes of large language models, often exceeding 100k tokens, impose a computational bottleneck on the final linear projection layer during speculative decoding. Existing vocabulary pruning solutions rely on static or coarsely-grained sub-vocabularies that necessitate large active sizes ($\sim$30k) to maintain draft quality. We propose NanoSpec, a novel training-free approach that breaks this trade-off by dynamically constructing a minimalist, context-aware active vocabulary for each generation step. Leveraging the inherent temporal locality of language generation, NanoSpec achieves high coverage while slashing the average vocabulary size by over $40\times$ (to $<$3k tokens) without requiring any auxiliary trained parameters. To realize the theoretical benefits of such high sparsity on modern hardware, we introduce a system-algorithm co-design that overcomes the inefficiencies of sparse memory access through asynchronous gathering and GPU-resident state management. As a complementary plug-and-play module, NanoSpec cuts draft time by an average of 51.6\%, delivering a $1.17$-$1.29\times$ end-to-end speedup over the state-of-the-art speculative decoding methods EAGLE-2 and EAGLE-3 across 7 tasks and outperforming complex training-based pruning baselines.
Abstract（参考訳）: 大規模言語モデルの大規模な語彙サイズは、しばしば100kトークンを超え、投機的復号中に最終線形射影層に計算的ボトルネックを課す。既存のボキャブラリプルーニングソリューションは、ドラフト品質を維持するために大きなアクティブサイズ($30k)を必要とする静的または粗粒のサブボキャブラリに依存している。我々は,このトレードオフを打破する新しいトレーニングフリーアプローチであるNanoSpecを提案し,各生成ステップに対して最小限の文脈対応のアクティブ語彙を動的に構築する。言語生成の時間的局所性を活用して、NanoSpecは、補助的な訓練されたパラメータを必要とせずに、平均語彙サイズを$40\times$($<3kトークン)以上削減しながら、高いカバレッジを達成する。現代のハードウェアにそのような分散性の理論的利点を実現するため,非同期収集とGPU常駐状態管理によるスパースメモリアクセスの非効率性を克服するシステムアルゴリズムの共設計を導入する。補完的なプラグ・アンド・プレイモジュールとして、NanoSpecはドラフト時間を平均51.6\%削減し、最先端の投機的復号法 EAGLE-2 と EAGLE-3 よりも1.17$-1.29\times$ end-to-end のスピードアップを実現し、複雑なトレーニングベースのプルーニングベースラインを上回った。

論文の概要: NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies

関連論文リスト