Fugu-MT 論文翻訳(概要): MacTok: Robust Continuous Tokenization for Image Generation

論文の概要: MacTok: Robust Continuous Tokenization for Image Generation

arxiv url: http://arxiv.org/abs/2603.29634v1
Date: Tue, 31 Mar 2026 12:00:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.594568
Title: MacTok: Robust Continuous Tokenization for Image Generation
Title（参考訳）: MacTok: イメージ生成のためのロバストな継続的トークン化
Authors: Hengyu Zeng, Xin Gao, Guanghao Li, Yuxiang Yan, Jiaoyang Ruan, Junpeng Ma, Haoyu Albert Wang, Jian Pu,
Abstract要約: textbfMacTokは1D textbfContinuous textbfTokenizerで、コンパクトで堅牢な表現を学ぶ。 MacTokは、画像内の情報領域を強調するためにランダムマスキングとDINO誘導セマンティックマスキングの両方を適用している。 ImageNetでは、MacTokは256$times$256で1.44の競合gFID、SiT-XLで512$times$512で最先端の1.52を達成している。
参考スコア（独自算出の注目度）: 19.46209544955821
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\times$256 and a state-of-the-art 1.52 at 512$\times$512 with SiT-XL, while reducing token usage by up to 64$\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.
Abstract（参考訳）: 連続画像トークン化器は効率的な視覚生成を可能にし、変分フレームワークに基づいてKL正規化によってスムーズで構造化された潜在表現を学習することができる。しかし、これは、エンコーダが圧縮された潜在空間に情報的特徴をエンコードできないトークンが少ない場合に、しばしば後続の崩壊を引き起こす。これを解決するために、コンパクトでロバストな表現を学習しながら、画像のマスキングと表現アライメントを活用する1D \textbf{C}ontinuous \textbf{Tok}enizer の \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer を導入し、分解を防止する。 MacTokはランダムマスキングとDINO誘導セマンティックマスキングを併用して画像内の情報領域を強調し、不完全な視覚的証拠から堅牢なセマンティックスをエンコードする。グローバルおよび局所的な表現アライメントと組み合わせて、MacTokは高度に圧縮された1D潜在空間においてリッチな識別情報を保存し、64または128トークンしか必要としない。 ImageNetでは、MacTokは256$\times$256で1.44で、SiT-XLで512$\times$512で最先端の1.52で、トークン使用率を64$\times$まで下げている。これらの結果から,マスキングとセマンティックガイダンスが組み合わさって後部崩壊を防ぎ,効率的な高忠実なトークン化を実現することが確認された。

論文の概要: MacTok: Robust Continuous Tokenization for Image Generation

関連論文リスト