Fugu-MT 論文翻訳(概要): HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

論文の概要: HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

arxiv url: http://arxiv.org/abs/2606.08302v1
Date: Sat, 06 Jun 2026 18:58:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.02908
Title: HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling
Title（参考訳）: HACK++: 効率的な視覚自己回帰モデリングのための、より効果的なヘッドアウェアキーバリュー圧縮を目指す
Authors: Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin,
Abstract要約: HACK++は、Visual Autoregressive (VAR)モデルのためのトレーニングフリーのヘッドアウェアキーバリュー圧縮フレームWorKである。独立した予算下でのキャッシュ圧縮からの注意を分離し、蓄積されたキャッシュをより積極的に圧縮しながら、現在のスケールの注意コストを制限します。例えば、Infinity-2B/8Bでは、HACK++は30%の注意予算と10%のキャッシュ予算しか持たず、1%のキャッシュ予算の下でも堅牢である。
参考スコア（独自算出の注目度）: 42.1403262611533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.
Abstract（参考訳）: Visual Autoregressive (VAR)モデルは次世代の予測パラダイムを採用しており、デコード手順が大幅に少ない高品質な生成を提供する。しかしながら、既存のVARモデルでは、キーバリュー(KV)キャッシュがスケールにわたって蓄積されているため、注意の複雑さとメモリオーバーヘッドが著しく低下する。本稿では,次世代のパラダイムにKVキャッシュ圧縮を導入することで,この問題に対処する。まず、VARの注意を詳細に分析し、注意頭が安定して機能的に異なる2つのカテゴリに分けることができることを観察する。それらの機能的ばらつきにより、既存のワンサイズ圧縮手法はVARモデルでは性能が良くない。さらに、この2つのヘッドタイプは、歴史的スケールに依存する点で著しく異なり、この依存は層や生成ステップにまたがって変化し、適応的なキャッシュ予算配分を主張する。これらの課題に対処するため、VARモデルのための訓練不要なヘッドアウェアキー値圧縮フレームWorKであるHACK++を提案する。 1回のオフラインキャリブレーションから、HACK++はヘッドタイプを分類し、ヘッド固有のプリミティブを導出する。推論では、独立した予算下でのキャッシュ圧縮から注意を分離し、現在の規模の注意コストを境界にしつつ、パターン固有の戦略と依存を意識した予算割り当てを通じて、蓄積されたキャッシュをより積極的に圧縮する。テキスト・ツー・イメージ、クラス・条件、統合された理解・生成タスクにわたる複数のVARモデルに対する大規模な実験は、HACK++の有効性と一般化性を検証する。例えば、Infinity-2B/8Bでは、HACK++は30%の注意予算と10%のキャッシュ予算しか持たず、1%のキャッシュ予算の下でも堅牢である。

論文の概要: HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

関連論文リスト