Fugu-MT 論文翻訳(概要): HiSpec: Hierarchical Speculative Decoding for LLMs

論文の概要: HiSpec: Hierarchical Speculative Decoding for LLMs

arxiv url: http://arxiv.org/abs/2510.01336v1
Date: Wed, 01 Oct 2025 18:04:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.809373
Title: HiSpec: Hierarchical Speculative Decoding for LLMs
Title（参考訳）: HiSpec: LLMの階層的投機的デコーディング
Authors: Avinash Kumar, Sujay Sanghavi, Poulami Das,
Abstract要約: 低オーバーヘッド中間検証のために$textitearly-exit(EE)モデルを利用する投機的復号化フレームワークを提案する。 HiSpecは平均1.28$times$、平均2.01$times$をベースラインのシングルレイヤの推測と比較して改善している。
参考スコア（独自算出の注目度）: 15.347747465564458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.
Abstract（参考訳）: 投機的復号化は、より小さなドラフトモデルを用いて、より大きなターゲットモデルが検証するトークンを推測することにより、LCM推論を加速させる。検証はボトルネックになることが多い(例えば、3Bモデルが70Bターゲットモデルを想定している場合、トークン生成よりも4\times$遅い)。検証は、不正確なドラフトトークンを早期に破棄することで検証時間を短縮するが、既存のメソッドは中間検証を組み込む際にかなりのトレーニングオーバーヘッドを発生させ、中間検証ステップをオーケストレーションするためにメモリフットプリントを増やし、近似ヒューリスティックスに頼ることによって精度を損なう。低オーバーヘッド中間検証のために$\textit{early-exit (EE)モデルを利用する高スループットな投機的デコーディングのためのフレームワークである$\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$を提案する。 EEモデルは、レイヤトラバーサルをスキップすることでトークンを早期に終了させ、選択されたレイヤの隠れた状態を解釈できるように明示的にトレーニングされ、計算とメモリのオーバーヘッドを大幅に増加させることなく、中間検証に一意に適合する。リソース効率をさらに向上するため、HiSpecは、ドラフト、中間検証、ターゲットモデル間でキー値キャッシュと隠蔽状態を再利用できる方法論を設計する。精度を維持するために、HiSpecは、中間検証者によって受け入れられたドラフトトークンを、ターゲットモデルに対して定期的に検証する。様々な代表ベンチマークとモデルを用いて評価したところ,HiSpecは平均1.28$\times$,最大2.01$\times$のスループット向上を実現している。

論文の概要: HiSpec: Hierarchical Speculative Decoding for LLMs

関連論文リスト