Fugu-MT 論文翻訳(概要): FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

論文の概要: FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

arxiv url: http://arxiv.org/abs/2605.09430v1
Date: Sun, 10 May 2026 09:07:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.247819
Title: FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
Title（参考訳）: FlashAR: 自動回帰画像生成のための効率的な後トレーニング高速化
Authors: Junkang Zhou, Yefei He, Feng Chen, Weijie Wang, Bohan Zhuang,
Abstract要約: 我々は、訓練済みの自己回帰モデルを高並列ジェネレータに効率的に適応する軽量なポストトレーニング適応フレームワークであるFlashARを紹介した。 FlashARは512x512の画像生成で最大22.9倍のスピードアップを達成する。
参考スコア（独自算出の注目度）: 35.20176824483236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.
Abstract（参考訳）: 大規模自己回帰モデルは、画像生成において顕著な能力を示した。しかし、そのシーケンシャルなラスタースキャン復号法は厳密な次の予測に依存しており、推論は違法に高価である。既存の加速法は、通常、スクラッチから高価な事前訓練を必要とする全く新しい世代パラダイムを導入するか、トレーニングと推論のギャップを犠牲にして並列生成を可能にするか、または予測目標を変更するかのいずれかである。本稿では,事前学習したラスタスキャンの自己回帰モデルを双方向次トーケン予測に基づく高並列ジェネレータに効率的に適応する軽量な後学習適応フレームワークであるFlashARを紹介する。我々の重要な洞察は、効果的な適応は、学習前の学習を維持するために、事前訓練されたモデルの本来の訓練目標の変更を最小限に抑えるべきであるということです。そこで我々は,元のARヘッドを水平方向の予測用ヘッドとして保持し,列方向の予測用として補完的で軽量な垂直方向のヘッドを導入する。効率的な適応を容易にするため、垂直ヘッドは最終層ではなく中間層から分岐し、固有水平ヘッドバイアスをバイパスする。さらに,水平方向と垂直方向の予測は,相対的重要性が目標位置によって異なる相補的依存関係を捉えるため,各位置における2つの予測を動的に組み合わせるために,学習可能な融合ゲートを用いる。適応コストをさらに低減するために,垂直ヘッドを事前学習した自己回帰モデルからの適応により初期化し,その後バックボーンを微調整して新しい復号パラダイムに適応させる2段階適応パイプラインを提案する。 LlamaGenとEmu3.5の大規模な実験により、FlashARは512x512の画像生成で最大22.9倍のスピードアップを達成した。

論文の概要: FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

関連論文リスト