Fugu-MT 論文翻訳(概要): Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

論文の概要: Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

arxiv url: http://arxiv.org/abs/2510.00636v1
Date: Wed, 01 Oct 2025 08:12:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.461436
Title: Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Title（参考訳）: 期待アテンション:将来のクエリ分布からのアテンション推定によるKVキャッシュ圧縮
Authors: Alessio Devoto, Maximilian Jeblick, Simon Jégou,
Abstract要約: 我々は、KVペアの重要性を予測し、将来のクエリがそれに参加するかを予測する、トレーニング不要な圧縮手法である、textbfExpected Attentionを紹介した。本手法はプリフィルとデコードの両方のフェーズでシームレスに動作し,両シナリオにおいて常に最先端のベースラインよりも優れています。 $textbfweがKVPressをリリースした。KVキャッシュ圧縮メソッドの実装とベンチマークを可能にする包括的なライブラリだ。
参考スコア（独自算出の注目度）: 2.894551569099569
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory consumption of the Key-Value (KV) cache represents a major bottleneck for efficient large language model inference. While attention-score-based KV cache pruning shows promise, it faces critical practical limitations: attention scores from future tokens are unavailable during compression, and modern implementations like Flash Attention do not materialize the full attention matrix, making past scores inaccessible. To overcome these challenges, we introduce $\textbf{Expected Attention, a training-free compression method}$ that estimates KV pairs importance by predicting how future queries will attend to them. Our approach leverages the distributional properties of LLM activations to compute expected attention scores in closed form for each KV pair. These scores enable principled ranking and pruning of KV pairs with minimal impact on the residual stream, achieving effective compression without performance degradation. Importantly, our method operates seamlessly across both prefilling and decoding phases, consistently outperforming state-of-the-art baselines in both scenarios. Finally, $\textbf{we release KVPress, a comprehensive library to enable researchers to implement and benchmark KV cache compression methods, already including more than 20 techniques}$.
Abstract（参考訳）: キーバリュー(KV)キャッシュのメモリ消費は、効率的な大規模言語モデル推論の大きなボトルネックである。注意スコアベースのKVキャッシュプルーニングは有望だが、重要な現実的な制限に直面している。将来のトークンからのアテンションスコアは圧縮時に利用できない。これらの課題を克服するために、トレーニング不要な圧縮メソッドである$\textbf{Expected Attention}$を導入し、KVペアの重要性を、将来のクエリがそれに参加するかを予測する。提案手法は,LLM活性化の分布特性を利用して,各KV対の閉形式で期待される注意点を算出する。これらのスコアは、残留ストリームへの影響を最小限に抑えたKVペアの原則的なランク付けとプルーニングを可能にし、性能劣化なしに効果的な圧縮を実現する。重要なこととして,本手法はプリフィルとデコードの両方のフェーズにわたってシームレスに動作し,両シナリオにおける最先端のベースラインを一貫して上回っている。最後に、$\textbf{we Release KVPressは、研究者がKVキャッシュ圧縮メソッドの実装とベンチマークを可能にする包括的なライブラリである。

論文の概要: Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

関連論文リスト