Fugu-MT 論文翻訳(概要): EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

論文の概要: EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

arxiv url: http://arxiv.org/abs/2603.12108v1
Date: Thu, 12 Mar 2026 16:13:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.204467
Title: EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation
Title（参考訳）: EvoTok:視覚的理解と生成のための残差進化による統一イメージトケナイザ
Authors: Yan Li, Ning Liao, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang, Yifan Yang, Junchi Yan, Xue Yang,
Abstract要約: 理解には高レベルのセマンティック抽象化が必要であり、画像生成には微細なピクセルレベルの表現が必要である。 EvoTokは、共有潜在空間内の残差進化過程を通じてこれらの要求を整合する統合画像トークンである。 EvoTokは9つの視覚的理解ベンチマークのうち7つで有望なパフォーマンスを示しており、GenEvalやGenAI-Benchといった画像生成ベンチマークでは顕著な結果を示している。
参考スコア（独自算出の注目度）: 68.09145886228585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.
Abstract（参考訳）: 統合マルチモーダル大言語モデル(MLLM)の開発は、視覚的理解と生成の間の粒度ギャップにより、基本的には課題である。既存のアプローチは、通常、同じ表現のセットで2つの監督を強制するか、別々の特徴空間でこれら2つの監督を分離し、それぞれ干渉と矛盾をもたらす。本研究では,これらの要求を共有潜在空間内の残差進化過程を通じて整合する統合画像トークンであるEvoTokを提案する。ピクセルとセマンティクスのための別々のトークン空間を維持する代わりに、EvoTokは画像を残留ベクトル量子化を介して残留トークンのカスケードシーケンスに符号化する。この残留配列は進化軌道を形成し、初期の段階は低レベルの詳細を捉え、より深い段階は高レベルな意味表現へと徐々に移行する。 13Mイメージの比較的控えめなデータセットでトレーニングされているにもかかわらず、以前の多くの統一トークンエーサが使用した数十億規模のデータセットよりもはるかに小さいが、EvoTokは、256x256解像度でImageNet-1K上で0.43 rFIDの強い再構成品質を達成する。大きな言語モデルに統合されると、EvoTokは9つの視覚的理解ベンチマークのうち7つで有望なパフォーマンスを示し、GenEvalやGenAI-Benchといった画像生成ベンチマークでは顕著な結果を示している。これらの結果は、視覚的表現を進化軌跡としてモデル化することは、視覚的理解と生成を統一するための効果的で原則化されたソリューションであることを示している。

論文の概要: EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

関連論文リスト