Fugu-MT 論文翻訳(概要): Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

論文の概要: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

arxiv url: http://arxiv.org/abs/2505.22179v2
Date: Thu, 29 May 2025 04:07:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-30 13:10:25.787437
Title: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Title（参考訳）: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Authors: Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu,
Abstract要約: 投機的復号化と量子化は、大きな言語モデルのメモリバウンド推論を効果的に加速する。量子化は、重みとアクティベーションを低ビット幅に圧縮することでこれを達成し、低ビット行列乗算による計算を減らす。実験により、4ビットの重み量子化によるメモリの利点は、投機的復号化による計算負荷によって減少することが示された。
参考スコア（独自算出の注目度）: 34.04231165571518
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
Abstract（参考訳）: 投機的復号化と量子化は、大きな言語モデルのメモリバウンド推論を効果的に加速する。投機的復号化は、1つのフォワードパス内で複数のトークンを検証することでメモリ帯域のボトルネックを軽減し、計算労力を増大させる。量子化は、重みとアクティベーションを低ビット幅に圧縮することでこの最適化を実現し、低ビット行列乗算による計算を減らす。それらの強みをさらに活用するために、これらの2つの手法の統合について検討する。驚いたことに、先進的な投機復号法EAGLE-2を様々な量子化モデルに適用した実験により、4ビットの重み量子化によるメモリの利点が投機復号による計算負荷によって減少することが判明した。具体的には、ツリースタイルのドラフトを検証することは、4ビットの量子化モデルにおいて、シングルトークンのフォワードパスよりもはるかに多くの時間的オーバーヘッドを発生させる。この発見は、新しい投機的デコード設計につながった: 小さなモデルを中間段階として使用し、ツリースタイルのドラフトをシーケンスドラフトに変換し、ターゲットの量子化モデルのメモリアクセスの利点を活用する階層的フレームワーク。実験の結果,A100 GPU上の4ビットLlama-3-70Bモデルにおいて,我々の階層的アプローチは2.78$\times$の高速化を実現し,EAGLE-2を1.31$\times$で上回ることがわかった。コードはhttps://github.com/AI9Stars/SpecMQuant.comで公開されている。

論文の概要: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

関連論文リスト