Fugu-MT 論文翻訳(概要): LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

論文の概要: LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

arxiv url: http://arxiv.org/abs/2510.11292v1
Date: Mon, 13 Oct 2025 11:28:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.34087
Title: LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
Title（参考訳）: LouisKV: 長い入力出力シーケンスのための効率的なKVキャッシュ検索
Authors: Wenbo Wu, Qingyi Si, Xiurui Pan, Ye Wang, Jie Zhang,
Abstract要約: キーバリュー(KV)キャッシュは、自動回帰モデルにおける冗長な計算の削減に成功している。メモリオーバーヘッドが大幅に増加し、長時間のシナリオでの実際のデプロイメントが制限される。既存のKV検索手法は,ページ単位の検索やページ単位の粗いKV管理によって,顕著な効率性と精度のボトルネックに悩まされている。
参考スコア（独自算出の注目度）: 12.093166735658626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy leveraging temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model's attention patterns, enabling precise identification of critical KVs. Furthermore, to boost efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluations show that LouisKV achieves up to 4.7$\times$ speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.
Abstract（参考訳）: Key-Value(KV)キャッシュは自動回帰モデルにおける冗長な計算の削減に成功しているが、メモリオーバーヘッドが大幅に増加し、長時間のシナリオでの実際のデプロイメントが制限される。既存のKV検索方法は、GPU上のKVエントリのサブセットのみを動的に保持することで、これを緩和する。しかし、特に長期の推論シナリオにおいて、トーケン毎の検索やページレベルの粗いKV管理による顕著な効率性と精度のボトルネックに悩まされている。大きな推論モデルの出現に伴い、このようなシナリオを効率的に扱うことがますます重要になっている。この問題に対処するため,(1)臨界KVはデコード中に強い時間的局所性を示し,(2)これらのKVは入力プロンプトと出力の異なる分布パターンを示す。これらの観測に基づいて,LouisKVを提案する。LouisKVは,様々な長期シナリオを対象とした効率的なKVキャッシュ検索フレームワークである。特にLouisKVは、時間的局所性を活用したセマンティック・アウェアな検索戦略を導入し、セマンティック・バウンダリでのみ検索をトリガーし、計算とデータ転送のオーバーヘッドを大幅に削減する。 LouisKVはまた、入力シーケンスと出力シーケンスの戦略を個別に調整し、モデルの注意パターンに合った検索ユニットを作成し、重要なKVの正確な識別を可能にする、分離されたきめ細かな管理スキームを設計した。さらに効率を上げるために、LouisKVはカスタムのTritonやCUDAカーネルなど、いくつかのカーネルレベルの最適化を導入し、KVクラスタリングと検索を高速化した。評価の結果、LouisKVは最先端のKV検索手法よりも最大4.7$\times$スピードアップを達成でき、長い出力のショートアウト、短出力のロングアウトプット、長い出力のロングアウトプットシナリオを含む様々なロングシーケンスタスクにおいてほぼ無作為な精度を維持している。

論文の概要: LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

関連論文リスト