Fugu-MT 論文翻訳(概要): HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

論文の概要: HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

arxiv url: http://arxiv.org/abs/2605.14877v1
Date: Thu, 14 May 2026 14:22:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.87008
Title: HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
Title（参考訳）: HeatKV:視覚自己回帰モデリングのためのヘッドチューニングKV-cache圧縮
Authors: Jonathan Cederlund, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson,
Abstract要約: HeatKVは、各ヘッドのキャッシュ割り当てを、そのアテンションに基づいて予め生成されたスケールに適応させる新しい圧縮手法である。 HeatKVは、VARモデルKV-cache圧縮のための新しい最先端(SOTA)を実現し、粒度の細かいヘッド固有のキャッシュ割り当ての有効性を示している。
参考スコア（独自算出の注目度）: 2.8560048042907744
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.
Abstract（参考訳）: Visual Autoregressive (VAR)モデルは最近、低レイテンシを維持しながら、印象的な画像生成品質を誇示している。しかし、それらはKVキャッシュの厳しいメモリ制約に悩まされ、しばしば生成された画像ごとにギガバイトのメモリを必要とする。本研究では,各ヘッドにキャッシュ割り当てを適応させる新しい圧縮手法であるHeatKVを紹介する。小さなオフラインキャリブレーションセットを使用して、アテンションヘッドは、以前のスケールよりもアテンションスコアに従ってランク付けされる。このランキングに基づいて、所定のメモリ予算に合わせて静的プルーニングスケジュールを構築する。 Infinity-2Bモデルに適用すると、HeatKVはKVキャッシュのメモリ割り当てにおける圧縮率を2ドル(約2,300円)で達成できる。提案手法は,VARモデルKVキャッシュ圧縮のための新しい最先端(SOTA)を実現する。

論文の概要: HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

関連論文リスト