Fugu-MT 論文翻訳(概要): Autoregressive Visual Generation Needs a Prologue

論文の概要: Autoregressive Visual Generation Needs a Prologue

arxiv url: http://arxiv.org/abs/2605.06137v1
Date: Thu, 07 May 2026 12:35:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.776664
Title: Autoregressive Visual Generation Needs a Prologue
Title（参考訳）: 自己回帰型ビジュアルジェネレーションはプロローグを必要とする
Authors: Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu,
Abstract要約: Prologueは自己回帰(AR)画像生成における再構成世代ギャップを埋めるアプローチである。プロローグは、視覚トークンシーケンスに先立つ小さなプロローグトークンセットを生成する。提案手法は,学習した生成表現を別々に導入することにより,生成品質を向上させることができる。
参考スコア（独自算出の注目度）: 21.427403915969872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.
Abstract（参考訳）: 本稿では,自己回帰(AR)画像生成における再生・再生ギャップを埋める手法であるPrologueを提案する。復元と生成の両方を満たすために視覚トークンを変更する代わりに、プロローグは視覚トークンシーケンスに先立つ小さなプロローグトークンセットを生成する。これらのプロローグトークンはARクロスエントロピー(CE)損失にのみ訓練されるが、視覚トークンは再建専用である。この分離された設計により、再構成品質に影響を与えることなく、ARモデルの真の分布による生成を最適化することが可能となり、ELBOの観点からさらに形式化される。 ImageNet 256x256 では、Prologue-Base は gFID を 21.01 から 10.75 に減らし、再構成をほとんど変更することなく、分類なしのガイダンスを保ち、Prologue-Large は、標準的な AR モデルを用いて 0.99 と gFID の 1.46 の競合 rFID に到達している。興味深いことに、プロローグトークンはAR勾配のみによって駆動され、創発的な意味構造を示す: 16個のプロローグトークンの線形プローブは、標準トークンからの最初の16個のトークンの23.71%をはるかに上回る35.88%のTop-1に達する。生成品質は、元の表現をそのまま残しながら、別々に学習した生成表現を導入することで改善できる。

論文の概要: Autoregressive Visual Generation Needs a Prologue

関連論文リスト