Fugu-MT 論文翻訳(概要): Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

論文の概要: Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

arxiv url: http://arxiv.org/abs/2512.01278v1
Date: Mon, 01 Dec 2025 04:50:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 19:46:34.68783
Title: Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Title（参考訳）: 疎自己投機的復号化による大規模推論モデルの高速化
Authors: Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang, Chaofan Lin, Jongseok Park, Guangxuan Xiao, Mohamed S. Abdelfattah, Mingyu Gao, Baris Kasikci, Song Han, Ion Stoica,
Abstract要約: SparseSpecは、ドラフトとターゲットモデルと同じモデルを再利用する投機的復号化フレームワークである。 SparseSpecは、新しいスパースアテンションメカニズムであるPillarAttnをドラフトモデルとして特徴付け、検証段階からの情報を再利用することで、クリティカルトークンを正確に選択する。さまざまなモデルとデータセットにわたって、SparseSpecは最先端のソリューションより優れており、スループットは最大2.13倍である。
参考スコア（独自算出の注目度）: 39.863506456723655
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SparseSpec co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SparseSpec outperforms state-of-the-art solutions, with an up to 2.13x throughput speedup.
Abstract（参考訳）: 推論言語モデルは、精巧なチェーン・オブ・ソート(CoT)ソリューションを生成することで、困難なタスクに顕著な能力を示してきた。しかし、このような長い生成は、推論ボトルネックを計算バウンドからメモリバウンドにシフトさせる。それぞれのトークンを生成するために、モデルは、以前生成されたすべてのトークンに完全に注意を払っており、ますます大きなKV-Cacheへのメモリアクセスを必要としている。その結果、世代が長くなるとステップ毎にメモリアクセスが増加し、メモリ帯域幅が大幅に増大する。これを解決するために、ドラフトモデルとターゲットモデル(すなわち自己定義)と同じモデルを再利用する投機的復号化フレームワークであるSparseSpecを紹介します。 SparseSpecは、新しいスパースアテンションメカニズムであるPillarAttnをドラフトモデルとして、検証段階から情報をエレガントに再利用することで、クリティカルトークンを正確に選択する。さらに、SparseSpecは、(1)バッチトークンのドラフトと検証のための統一スケジューラ、(2)CPU/GPUオーバーラップの遅延検証、(3)メモリ利用を最大化するための動的KVキャッシュ管理の3つのシステム革新で自己定義を設計する。さまざまなモデルとデータセットにわたって、SparseSpecは最先端のソリューションより優れており、スループットは最大2.13倍である。

論文の概要: Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

関連論文リスト