Fugu-MT 論文翻訳(概要): IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

論文の概要: IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

arxiv url: http://arxiv.org/abs/2510.06928v1
Date: Wed, 08 Oct 2025 12:08:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.474891
Title: IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction
Title（参考訳）: IAR2:Semantic-Detail Associated Token Predictionによる自己回帰視覚生成の改善
Authors: Ran Yi, Teng Hu, Zihan Su, Lizhuang Ma,
Abstract要約: IAR2は、階層的なセマンティックディーテール合成プロセスを可能にする高度な自己回帰フレームワークである。我々は、IAR2が自動回帰画像生成のための新しい最先端技術を設定し、ImageNet上で1.50のFIDを達成することを示す。
参考スコア（独自算出の注目度）: 77.06211178777939
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.
Abstract（参考訳）: 自己回帰モデルは視覚コンテンツ作成の強力なパラダイムとして現れてきたが、しばしば視覚データの本質的な構造的特性を見落としている。我々の以前の研究であるIARは、埋め込み類似性に基づいて視覚コードブックを再編成し、生成の堅牢性を改善することで、この問題に対処する方向を開始した。しかし、事前訓練されたコードブックの剛性や、ハードで均一なクラスタリングの不正確さに制約されている。これらの制限を克服するために,階層的な意味・詳細合成プロセスを可能にする高度な自己回帰フレームワークであるIAR2を提案する。 IAR2の中核にはセマンティック・ディーテール関連デュアルコードブックがあり、これは画像表現を大域的意味情報のためのセマンティックコードブックと細かな精細化のための詳細コードブックに分解する。量子化能力は線形から多項式スケールに拡張され、表現性を大幅に向上する。この双対表現に対応するために、局所文脈拡張自己回帰ヘッドと組み合わせたセマンティック・ディーテール自己回帰予測スキームを提案する。さらに、条件生成のために、各トークンの誘導スケールを動的に変調するプログレッシブアテンションガイド適応CFG機構を導入し、その状態と生成シーケンスの時間的位置との関係を考慮し、現実性を犠牲にすることなく条件アライメントを改善する。大規模な実験では、IAR2が自動回帰画像生成のための新しい最先端技術を設定し、ImageNetで1.50のFIDを達成することが示されている。提案手法は,従来の手法に勝るだけでなく,計算効率も向上し,構造化された粗大な生成戦略の有効性を浮き彫りにしている。

論文の概要: IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

関連論文リスト