Fugu-MT 論文翻訳(概要): AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

論文の概要: AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

arxiv url: http://arxiv.org/abs/2510.07486v1
Date: Wed, 08 Oct 2025 19:36:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:14.694566
Title: AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding
Title（参考訳）: AsyncSpade: 非同期スパースデコーディングによる効率的なテスト時間スケーリング
Authors: Shuqing Luo, Yilin Guan, Pingzhi Li, Hanrui Wang, Tianlong Chen,
Abstract要約: テストタイムスケーリング(TTS)は長いチェーン・オブ・シント(CoT)を介してLCM推論を促進する KV-cache成長は、LLMデコーディングのメモリバウンドボトルネックを増幅する。 2つのコアコンポーネント上に構築された効率的なTSのための非同期フレームワークであるAsyncSpadeを提案する。
参考スコア（独自算出の注目度）: 35.10915929939651
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).
Abstract（参考訳）: テストタイムスケーリング(TTS)は、長いチェーン・オブ・シークレット(CoT)を介してLLM推論を促進するが、線形KVキャッシュ成長はLLM復号のメモリバウンドボトルネックを増幅する。クエリ対応のページレベルのスパースデコーディングは、制約付きFLOPの予算下での最先端のパフォーマンスを実現することができるが、逐次依存ページフィルタリングと粗粒度トークン選択の両方によって制限されている。本稿では,最近のクエリのショートウインドウから,現在のステップのクエリ状態を統一的に近似し,デコードループを待たずに,トレーニング不要なクエリアウェアスを実現できることを最初に見出した。我々は,(1)次トーケンクエリ状態を予測する新しい軽量時間回帰モジュール,(2)自動回帰デコードループからKVキャッシュフィルタを分離する非同期・非集約フレームワーク,の2つのコアコンポーネント上に構築された効率的なTSのための非同期フレームワークであるAsyncSpadeを提案する。私たちの知る限り、AsyncSpadeは、モデルパフォーマンスを犠牲にすることなく、シーケンシャルな依存を取り除く最初の方法です。そこではAsyncSpadeがKV-cache操作と推論パイプラインと完全に重なり合っており、理論的に最適時間/アウトプット・トケン(TPOT)を実現する。具体的には、AsyncSpadeは、様々なTSベンチマーク(AIME-24/25, GPQA-Diamond, MATH-500)で精度を一致または上回りながら、SoTAベースライン(クエスト)に比べて20%以上TPOTが減少し、Qwen3-8BとQwen3-32Bのモデルに対して少なくとも50%TPOTが減少する。

論文の概要: AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

関連論文リスト