Fugu-MT 論文翻訳(概要): Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

論文の概要: Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

arxiv url: http://arxiv.org/abs/2603.20616v1
Date: Sat, 21 Mar 2026 03:21:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:38.996726
Title: Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression
Title（参考訳）: Token Eviction: 効率的なKVキャッシュ圧縮のための混合次元予算配分
Authors: Ruijie Miao, Zhiming Wang, Wang Li, Shiwei Wu, Shufan Liu, Yanbing Jiang, Tong Yang,
Abstract要約: キー値キャッシュ(KV)はトランスフォーマー推論の高速化に広く用いられているが、メモリコストは入力長とともに線形に増加する。より粒度の細かいトークンに次元を割り当てる混合次元KVキャッシュ圧縮法であるMixedDimKVを提案する。私たちのソリューションは、キャッシュの0.26%をわずかに使用しながら、50Kのコンテキスト長で100%の精度を維持します。
参考スコア（独自算出の注目度）: 16.495900730787955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.
Abstract（参考訳）: キーバリューキャッシュ(KV)は、トランスフォーマー推論の高速化に広く用いられているが、メモリコストは入力長とともに線形に増加し、長文展開が制限される。既存のトークン消去法は、重要でないトークンを捨てることでメモリを減少させるが、これは、各トークンをゼロ次元またはフル次元に割り当てる粗い次元縮小形式と見なすことができる。我々は、より粒度の細かいトークンに次元を割り当てる混合次元KVキャッシュ圧縮法であるMixedDimKVと、さらにヘッドレベルの重要情報を統合するMixedDimKV-Hを提案する。長文ベンチマークの実験では、MixedDimKVは、ヘッドレベルの重要度プロファイリングに依存しないKVキャッシュ圧縮手法よりも優れていた。同じヘッドレベルの重要情報を備えている場合、MixedDimKV-Hは一貫してHeadKVを上回っている。特に、我々の手法はKVキャッシュの6.25%しか持たないLongBenchに対して、同等のパフォーマンスを実現している。さらに,Needle-in-a-Haystackテストでは,最大0.26%のキャッシュを使用しながら,50Kコンテキスト長で100%の精度を維持した。

論文の概要: Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

関連論文リスト