Fugu-MT 論文翻訳(概要): Vision Foundation Models as Generalist Tokenizers for Image Generation

論文の概要: Vision Foundation Models as Generalist Tokenizers for Image Generation

arxiv url: http://arxiv.org/abs/2605.18390v1
Date: Mon, 18 May 2026 13:38:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 23:51:08.410207
Title: Vision Foundation Models as Generalist Tokenizers for Image Generation
Title（参考訳）: 画像生成のための汎用トケナイザとしてのビジョンファウンデーションモデル
Authors: Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi,
Abstract要約: 凍結視覚基礎モデル(VFM)上に一般画像トークン化器を構築する。離散空間と連続空間の両方でシームレスに動作可能なトークン化器 VFMTok を提案する。 VFM事前学習中に利用した自己教師型学習目標が,トークン化器としての有効性を規定していることが判明した。
参考スコア（独自算出の注目度）: 43.17659097958283
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.
Abstract（参考訳）: 本研究では,凍結視覚基盤モデル (VFM) 上で, 一般画像トークン化器を直接構築する上での, 未解明の方向性について検討する。このトークン化器を構築するために,凍結したVFMをエンコーダとして利用し,(1)標準2次元グリッド特徴の空間的冗長性を排除するための領域適応量子化フレームワーク,(2)復号された出力をVFMの表現と整合させて意味的忠実性を保存する意味再構成という2つの重要なイノベーションを導入する。これらの設計を基礎として,離散空間と連続空間の両方でシームレスに動作可能な汎用視覚トークンであるVFMTokを提案する。 VFMTokは、トークン効率を大幅に向上させながら、合成品質を大幅に改善する。離散自己回帰(AR)生成では、モデル収束を \textbf{3 times} で加速し、ImageNetのクラス条件合成で \textbf{1.36} の最先端 gFID を達成する。同様に、連続空間生成に対しては、VFMTok をデノナイジングモデルに統合すると、 \textbf{1.25} の例外的な gFID が得られる。さらに、潜在空間は本質的にリッチな空間意味論を捉えるため、VFMTokは両方の生成パラダイムにまたがる分類自由誘導(\textbf{w/o CFG})を使わずに高忠実なクラス条件合成を可能にし、推論速度を著しく加速する。これらの顕著な経験的結果の他に、我々のアプローチの根底にあるメカニズムを体系的に研究する。我々は, VFM事前学習中に利用した自己教師型学習目標が, トークン化器としての有効性を判断することを発見した。具体的には、グローバルコントラスト学習と潜在マスク画像モデリングを併用したVFMが、画像トークン化のための最適な表現を提供する。これらの知見は強力な基盤を確立し、将来の画像トークン化器の設計のための貴重なガイダンスを提供する。

論文の概要: Vision Foundation Models as Generalist Tokenizers for Image Generation

関連論文リスト