Fugu-MT 論文翻訳(概要): CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

論文の概要: CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

arxiv url: http://arxiv.org/abs/2510.26843v1
Date: Thu, 30 Oct 2025 08:51:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 17:52:15.864838
Title: CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
Title（参考訳）: CAS-Spec:LLMのオンザフライロスレス推論高速化のためのカスケード適応型自己スペクトル復号法
Authors: Zhiyuan Ning, Jiawei Shao, Ruge Xu, Xinfei Guo, Jun Zhang, Chi Zhang, Xuelong Li,
Abstract要約: 投機的復号化は、大きな言語モデルをデプロイする際のシームレスな統合と広範なユーティリティを提供する。ドラフトモデルの階層化は、さらなる加速と柔軟性を約束するが、複数のモデルをトレーニングするコストが高いため、実用的応用は制限されている。本稿では,投機的ドラフトモデルを構成するCascade Adaptive Self-Speculative Decoding(CAS-Spec)手法を提案する。
参考スコア（独自算出の注目度）: 48.8252978488871
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from $1.1\times$ to $2.3\times$ over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by $47$\% and $48$\% over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.
Abstract（参考訳）: 投機的復号化は、大規模言語モデル(LLM)をデプロイする際のロスレス推論加速に有効な手法として広く採用されている。オンザフライの自己投機的手法はシームレスな統合と幅広いユーティリティを提供するが、専門的な訓練に依存した手法によって達成される速度向上には欠けることが多い。ドラフトモデルの階層化は、さらなる加速と柔軟性を約束するが、複数のモデルをトレーニングするコストが高いため、実用的応用は制限されている。本稿では,動的切替型推論加速(DSIA)戦略を活用することで投機的ドラフトモデルを構築するCascade Adaptive Self-Speculative Decoding(CAS-Spec)手法を提案する。さらに、自己投機的復号法に適用した場合、従来の縦横カスケードアルゴリズムは非効率である。本稿では,動的ツリーカスケード(DyTC)アルゴリズムを提案する。このアルゴリズムは,受入率と遅延予測のヒューリスティックスに基づいて,多段階のドラフトモデルを適応的にルーティングし,ドラフト長を割り当てる。我々のCAS-Spec法は、既存のオンザフライ投機的復号法と比較して最先端の高速化を実現し、平均速度は1.1\times$から2.3\times$まで、様々なLLMやデータセットをまたいだ自己回帰復号法よりも高い。 DyTCは、カスケードベースのベースラインアルゴリズムとツリーベースのベースラインアルゴリズムに対して、平均スピードアップを4,7$\%と4,8$\%改善する。 CAS-Specは、既存のほとんどのLCMに容易に統合でき、自己投機的復号法が進化を続けるにつれて、さらなる加速の可能性を秘めている。

論文の概要: CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

関連論文リスト