Fugu-MT 論文翻訳(概要): KVBuffer: IO-aware Serving for Linear Attention

論文の概要: KVBuffer: IO-aware Serving for Linear Attention

arxiv url: http://arxiv.org/abs/2605.19049v1
Date: Mon, 18 May 2026 19:14:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.957672
Title: KVBuffer: IO-aware Serving for Linear Attention
Title（参考訳）: KVBuffer: 線形注意のためのIO対応サービス
Authors: Longwei Zou, Lin Zhong,
Abstract要約: 線形注意のためのIO対応機能機構であるKVバッファを提案する。 KV Bufferは、サービスシステムがより柔軟でメモリ効率の良い方法で線形アテンション出力を計算することを可能にする。評価の結果、KVバッファは、線形注意復号遅延を最大45.17%削減し、サービス要求の最大回数を5倍にすることができることがわかった。
参考スコア（独自算出の注目度）: 3.3481337735067673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For speculative decoding, KVBuffer verifies draft tokens in parallel and avoids storing temporary states. For short contexts, KVBuffer computes attention outputs directly from buffered keys and values, without creating or updating the linear attention state. We implement KVBuffer in SGLang for Qwen3-Next. Our evaluations show that KVBuffer can reduce linear attention decoding latency by up to 45.17% and increase the maximum number of serving requests by 5x for speculative decoding when verifying four draft tokens.
Abstract（参考訳）: 近年,文脈長に対する復号コストが一定であることから,長文推論において線形注意が注目されている。しかし、既存のサービスシステムは典型的には、繰り返し計算し、復号ステップ毎に大きな線形アテンション状態を更新して線形アテンションを提供する。状態はトークン単位のキーと値よりもはるかに大きいため、再帰復号化はメモリアクセスを著しく増加させ、線形の注意に役立てるには非効率になる。本稿では,リニアアテンションのためのIO対応サービス機構であるKVBufferを提案する。最近のキーと値をバッファリングすることで、KVBufferはサービスシステムがより柔軟でメモリ効率の良い方法で線形アテンション出力を計算することができる。復号化のために、KVBufferはチャンクワイズ計算を可能にし、状態更新を延期してバッチで適用することで、平均メモリアクセスと復号レイテンシを低減する。投機的復号化のために、KVBufferはドラフトトークンを並列に検証し、一時状態の保存を避ける。短いコンテキストでは、KVBufferはリニアアテンション状態の生成や更新をすることなく、バッファされたキーと値から直接アテンション出力を計算する。我々は,Qwen3-NextのSGLangにKVBufferを実装した。評価の結果、KVBufferは4つのドラフトトークンを検証した場合に、最大45.17%の線形注意復号遅延を減らし、投機的復号化のために最大5倍のサービス要求数を増大させることができることがわかった。

論文の概要: KVBuffer: IO-aware Serving for Linear Attention

関連論文リスト