Fugu-MT 論文翻訳(概要): Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

論文の概要: Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

arxiv url: http://arxiv.org/abs/2512.04857v1
Date: Thu, 04 Dec 2025 14:41:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-05 21:11:46.224268
Title: Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens
Title（参考訳）: 自動回帰画像生成は、わずか数行の切欠きトークンしか必要としない
Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin,
Abstract要約: LineARは、自動回帰画像生成のための新しい、トレーニング不要なプログレッシブキー値(KV)キャッシュ圧縮パイプラインである。 LineARは2Dビューを使用してラインレベルのキャッシュを管理し、視覚的依存領域を保存し、非形式的トークンを排除している。 LineARは67.61%のメモリ削減と7.57倍のスピードアップを含む、メモリとスループットの大幅な向上を実現している。
参考スコア（独自算出の注目度）: 33.3294598877681
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
Abstract（参考訳）: 自己回帰(AR)視覚生成は、そのスケーラビリティと汎用性から、画像とマルチモーダル合成の強力なパラダイムとして登場した。しかし、既存のAR画像生成は、デコード中に以前に生成されたすべてのビジュアルトークンをキャッシュする必要があるため、メモリボトルネックに悩まされており、高いストレージ要求と低スループットの両方につながっている。本稿では,自動回帰画像生成のための新しいトレーニングフリープログレッシブキー値 (KV) キャッシュ圧縮パイプラインである \textbf{LineAR} を紹介する。視覚的注意の本質的な特徴をフル活用することにより、LineARは2次元ビューを使用してラインレベルのキャッシュを管理し、視覚的依存領域を保存すると同時に、ライン間の注意によって導かれる、その後のライン生成に無害な少ないインフォーマルなトークンを徐々に取り除く。 LineARは、数行のキャッシュしか利用せず、メモリの節約とスループットの高速化の両方を実現し、生成品質を維持または改善することで、効率的な自己回帰(AR)画像生成を可能にする。クラス条件およびテキスト・ツー・イメージ生成を含む6つの自己回帰画像生成モデルに対する大規模な実験は、その有効性と一般性を検証する。 LineAR は ImageNet FID を 2.77 から 2.68 に、COCO FID を 23.85 から 22.86 に改善し、LlamaGen-XL と Janus-Pro-1B は 1/6 KV キャッシュしか保持していない。また、1/8 KVキャッシュでLumina-mGPT-768のDPGも改善した。また、LlamaGen-XLでは最大67.61%のメモリ削減と7.57倍のスピードアップ、Janus-Pro-7Bでは39.66%のメモリ削減と5.62倍のスピードアップを達成している。

論文の概要: Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

関連論文リスト