Fugu-MT 論文翻訳(概要): Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

論文の概要: Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

arxiv url: http://arxiv.org/abs/2605.20600v1
Date: Wed, 20 May 2026 01:30:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.426069
Title: Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation
Title（参考訳）: 効率的な自己回帰画像生成のための頭部認識キー値圧縮
Authors: Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Yunming Ye,
Abstract要約: オートレグレッシブ(AR)ビジュアル生成は目覚ましい性能を達成したが、高いメモリ使用量と低スループットに悩まされている。最近の研究では、数行のキャッシュトークンしか保持せず、高品質な画像を維持することができ、メモリ使用量を大幅に削減し、スループットを向上させることが示されている。本稿では,HeadKVと呼ばれる自己回帰画像生成のための新しいキー値(KV)キャッシュ圧縮フレームワークを提案する。
参考スコア（独自算出の注目度）: 27.042998548651358
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.
Abstract（参考訳）: 自己回帰(AR)ビジュアル生成は、優れたパフォーマンスを達成しているが、以前生成されたビジュアルトークンをキャッシュする必要があるため、高いメモリ使用量と低スループットに悩まされている。最近の研究では、数行のキャッシュトークンしか保持せず、高品質な画像を維持することができ、メモリ使用量を大幅に削減し、スループットを向上させることが示されている。しかし、これらの手法は、各アテンションヘッドに固定された予算を割り当て、アテンションヘッド間の不均一性を見渡して、サブ最適メモリ割り当てに繋がる。本稿では,異なる層にまたがる注目ヘッドが多様な注意パターンを示すのを観察する。この知見に基づいて,局所性に配慮した頭部とより大きな頭部により小さな予算を割り当てる,自己回帰画像生成のための新しいヘッドアウェア・キーバリュー(KV)キャッシュ圧縮フレームワークであるHeadKVを提案する。重要な課題は、キャッシュ圧縮をガイドするために、各アテンションヘッドのタイプを特定することである。さらに、同じ層内において、各ヘッドはトークンの位置をまたいで一貫した注意パターンを示す。この知見は、初期段階でヘッドタイプを識別し、世代を通してKV圧縮のために再利用できることを示唆している。その利点は、追加のトレーニングやデータセットレベルの統計処理を必要とせず、異なる入力に対してシームレスに一般化できることである。さらに,長距離情報を効果的に保存するための階層化トークン評価戦略を設計する。大規模な実験は、複数の自己回帰画像生成モデルにまたがってその効果を示す。

論文の概要: Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

関連論文リスト