Fugu-MT 論文翻訳(概要): Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

論文の概要: Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

arxiv url: http://arxiv.org/abs/2605.09649v1
Date: Sun, 10 May 2026 16:47:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.349578
Title: Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
Title（参考訳）: トーケン数を増やす:KVキャッシュによる長期的パフォーマンス向上を目指す
Authors: Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, Rex Ying,
Abstract要約: 我々は,各トークンの将来のユーティリティを統一メモリ予算の下で学習する,グローバルな保持に基づくKV消去手法を提案する。提案手法は,フルキャッシュ推論に適合したり,超えたりしながら,KVメモリを大幅に削減することを示す。これらの結果から,世界規模で校正されたKV消去は圧縮技術であるだけでなく,長文推論を改善するメカニズムでもあることが示唆された。
参考スコア（独自算出の注目度）: 65.710271475739
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.
Abstract（参考訳）: キー値(KV)キャッシュは、メモリと計算がシーケンス長で増大するロングコンテキスト推論において、大きなボトルネックとなる。既存のKV消去法は、このコストを削減するが、通常はフルキャッシュの推論と比較して性能を低下させる。長い文脈では、無関係なトークンは有用な証拠から注意をそらすことができるため、選択可能で学習可能な消去は、完全なキャッシュを近似するのではなく、生成を改善することができる。我々は,各トークンの将来のユーティリティを統一メモリ予算の下で学習する,グローバルな保持に基づくKV消去手法を提案する。軽量保持ゲートは、キャッシュされたKVエントリにユーティリティスコアを割り当て、共有された最終的なスコア予測は、これらのスコアをすべてのレイヤとヘッドで校正する。これにより、異なるレイヤ、ヘッド、モダリティからのトークンがキャッシュキャパシティに直接競合する、単一のグローバルな排除ポリシが可能になる。さらに,有用なトークンを優先的に保持することは注意の希釈を減少させることを示す理論解析を行い,将来的な用途のためのクエリ非依存のプロキシとして幾何的保持を正当化する。様々な長文言語や視覚言語推論,マルチターンダイアログベンチマークの他,本手法は,完全キャッシュ推論に適合あるいは超過しながら,KVメモリを大幅に削減する。これらの結果から,世界規模で校正されたKV消去は圧縮技術であるだけでなく,長文推論を改善するメカニズムでもあることが示唆された。

論文の概要: Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

関連論文リスト