Fugu-MT 論文翻訳(概要): Holistic Tokenizer for Autoregressive Image Generation

論文の概要: Holistic Tokenizer for Autoregressive Image Generation

arxiv url: http://arxiv.org/abs/2507.02358v1
Date: Thu, 03 Jul 2025 06:44:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-04 15:37:15.777222
Title: Holistic Tokenizer for Autoregressive Image Generation
Title（参考訳）: 自己回帰画像生成のためのホロスティックトケナイザ
Authors: Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi,
Abstract要約: 自己回帰(AR)画像生成のための新しい画像トークンであるtextitHita を紹介する。学習可能な全体的クエリとローカルパッチトークンを備えた、全体論的から局所的なトークン化スキームを導入している。実験では、HitaはARジェネレータのトレーニング速度を加速し、バニラ・トークンーザでトレーニングした者を上回る性能を発揮する。
参考スコア（独自算出の注目度）: 56.81871174745175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}
Abstract（参考訳）: バニラ自己回帰画像生成モデルは、ステップバイステップで視覚トークンを生成し、トークンシーケンス間の全体的関係をキャプチャする能力を制限する。さらに、ほとんどのビジュアルトークンエータは、ローカルイメージパッチを潜在トークンにマッピングし、グローバルな情報に制限を与える。そこで本稿では,自己回帰(AR)画像生成のための新しい画像トークンである \textit{Hita} を紹介する。学習可能な全体的クエリとローカルパッチトークンを備えた、全体論的から局所的なトークン化スキームを導入している。さらに、Hitaには、AR生成プロセスとの整合性を改善するための2つの重要な戦略が組み込まれている。 1) 先述したトークンの認識を維持するために因果的注意を用いて, 全体的トークンを初めから順に配置し, そして, パッチレベルトークンを連続的に配置する。 2) 復号化トークンをデコーダに送る前に、Hitaは情報フローを制御するために軽量のフュージョンモジュールを採用し、全体的なトークンを優先順位付けする。大規模な実験により、HitaはARジェネレータのトレーニング速度を加速し、バニラ・トークンーザでトレーニングされた者より優れており、ImageNetベンチマークで \textbf{2.59 FID} と \textbf{281.9 IS} を達成した。全体像の詳細な分析は、テクスチャ、材料、形状などのグローバルな画像特性を捉える能力を強調している。さらに、Hitaはゼロショットスタイルの転送や画像のインペインティングの効果も示している。コードは \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita} で公開されている。

論文の概要: Holistic Tokenizer for Autoregressive Image Generation

関連論文リスト