Fugu-MT 論文翻訳(概要): Hita: Holistic Tokenizer for Autoregressive Image Generation

論文の概要: Hita: Holistic Tokenizer for Autoregressive Image Generation

arxiv url: http://arxiv.org/abs/2507.02358v3
Date: Tue, 08 Jul 2025 13:43:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-09 12:20:17.777289
Title: Hita: Holistic Tokenizer for Autoregressive Image Generation
Title（参考訳）: Hita: 自己回帰画像生成のためのホロスティックトケナイザ
Authors: Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi,
Abstract要約: 自己回帰(AR)画像生成のための新しい画像トークンであるtextitHita を紹介する。学習可能な全体的クエリとローカルパッチトークンを備えた、全体論的から局所的なトークン化スキームを導入している。
参考スコア（独自算出の注目度）: 56.81871174745175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.
Abstract（参考訳）: バニラ自己回帰画像生成モデルは、トークンシーケンス間の全体的関係をキャプチャする能力を制限し、視覚トークンをステップバイステップで生成する。さらに、ほとんどのビジュアルトークンライザは、ローカルイメージパッチを潜在トークンにマッピングするため、グローバル情報は限られている。そこで本稿では,自己回帰(AR)画像生成のための新しい画像トークンである \textit{Hita} を紹介する。学習可能な全体的クエリとローカルパッチトークンを備えた、全体論的から局所的なトークン化スキームを導入している。 Hitaは、AR生成プロセスとの整合性を改善するための2つの重要な戦略を取り入れている。 1) 整合性トークンを初めから配置し、パッチレベルトークンを後にし、過去のトークンの認識を維持するために因果的注意を用いる連続的な構造を配置すること。 2)復号化トークンをデコーダに供給する前に軽量な融合モジュールを採用して情報の流れを制御し、全体的トークンを優先する。大規模な実験により、HitaはARジェネレータのトレーニング速度を加速し、バニラ・トークンーザでトレーニングされた者より優れており、ImageNetベンチマークで \textbf{2.59 FID} と \textbf{281.9 IS} を達成した。全体像の詳細な分析は、テクスチャ、材料、形状などのグローバルな画像特性を捉える能力を強調している。さらに、Hitaはゼロショットスタイルの転送や画像のインペインティングの効果も示している。コードは \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita} で公開されている。

論文の概要: Hita: Holistic Tokenizer for Autoregressive Image Generation

関連論文リスト