Fugu-MT 論文翻訳(概要): Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

論文の概要: Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

arxiv url: http://arxiv.org/abs/2603.11564v1
Date: Thu, 12 Mar 2026 05:36:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.913655
Title: Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
Title（参考訳）: 何が重要か:位置認識型擬似クエリによるデコード整列型KVキャッシュ圧縮
Authors: Zhenxu Tian, Yi Su, Juntao Li, Min Zhang,
Abstract要約: キーバリュー(KV)キャッシュは、効率的なLarge Language Models(LLM)推論に不可欠である。既存のKVキャッシュ圧縮手法は、プリフィル段階でトークンの重要性を推定するために入力側注意パターンに依存している。位置認識型擬似クエリ(DapQ)を提案し,位置認識型擬似クエリによるKVキャッシュ圧縮を近似する。
参考スコア（独自算出の注目度）: 39.38028687042293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).
Abstract（参考訳）: キーバリュー(KV)キャッシュは、効率的なLarge Language Models(LLM)推論に不可欠であるが、過度に長いコンテキストがKVキャッシュのメモリフットプリントを大幅に増加させる。既存のKVキャッシュ圧縮手法は、プリフィルの段階でトークンの重要性を推定するために、プロンプト観察ウィンドウ内の入力側の注意パターンに依存するのが一般的である。これらの評価は復号プロセスから導かれるものではないので、将来の世代にとって重要なトークンを保存できない。直感的には、効果的な観察ウィンドウはデコードステージのクエリを反映して、生成プロセスがどのトークンに参加するかを正確に反映する必要がある。しかし、地味なデコードクエリは本質的に推論時に利用できない。擬似クエリを構築してそれらを近似すると、位置情報の方が意味的コンテンツよりも重要な役割を果たすことが分かる。そこで本研究では,位置認識型疑似クエリ(DapQ)を用いたデコード整合型KVキャッシュ圧縮を提案する。これは,位置認識型疑似クエリを利用して出力トークンをシミュレートし,重要度評価のための効果的な観測窓を確立する,新しい軽量な消去フレームワークである。実際の生成コンテキストと密接に一致し、正確なトークンの排除を可能にする。複数のベンチマークとLCMにわたる広範囲な評価は、DapQが特に厳しいメモリ制約(例えば、3%のKVキャッシュ予算を持つNIAHで99.5%のロスレス性能)の下で、優れたパフォーマンスを達成していることを示している。

論文の概要: Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

関連論文リスト