Fugu-MT 論文翻訳(概要): ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

論文の概要: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

arxiv url: http://arxiv.org/abs/2504.01934v2
Date: Thu, 03 Apr 2025 16:43:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-04 12:51:12.749455
Title: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
Title（参考訳）: ILLUME+:デュアル視覚化と拡散リファインメントによる統一MLLMのイルミネーション
Authors: Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu,
Abstract要約: 既存の統一モデルでは、理解、生成、編集という3つの基本的な機能を統一モデルで扱うのに苦労している。 ILLUME+は、きめ細かいテクスチャとテキスト整合したセマンティクスを保存できる統合されたデュアルビジュアルトークンーであるDualViTokを導入した。また、画像デトケナイザとして拡散モデルを用いて、生成品質と高効率超解像を実現する。
参考スコア（独自算出の注目度）: 68.05833403672274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.
Abstract（参考訳）: 本稿では,2つの視覚的トークン化と拡散デコーダを利用するILLUME+について,深い意味理解と高忠実度画像生成の両方を改善する。既存の統一モデルは、理解、生成、編集という3つの基本的な機能を同時に扱うのに苦労している。 ChameleonやEMU3のようなモデルは、画像の識別にVQGANを使用している。これを軽減するため、LaViTとILLUMEはトークン化にセマンティックエンコーダを使用しているが、テクスチャの保存が悪いため画像編集に苦労している。一方、Janusシリーズは入力と出力の画像表現を分離し、インターリーブされた画像テキストの理解と生成をシームレスに扱う能力を制限する。これとは対照的に、ILLUME+は統合されたデュアルビジュアルトークンであるDualViTokを導入し、細粒度テクスチャとテキスト整列セマンティクスの両方を保存し、マルチモーダル理解と生成のための粗い画像表現戦略を可能にした。さらに、画像デトケナイザとして拡散モデルを用いて、生成品質と高効率超解像を実現する。 ILLUME+は、統合MLLM内の連続的なインプット、離散出力スキームに従い、視覚トークン化器、MLLM、拡散デコーダ間の動的解決をサポートするプログレッシブトレーニング手順を採用する。この設計により、様々なタスクにまたがるフレキシブルで効率的なコンテキスト対応の画像編集と生成が可能になる。 ILLUME+(3B)は、既存の統合MLLMと、マルチモーダル理解、生成、編集ベンチマークをまたいだ特殊なモデルとの競合性能を示す。その強力なパフォーマンスにより、ILLUME+は将来のマルチモーダルアプリケーションのためのスケーラブルで汎用的な基盤を提供する。プロジェクトページ: https://illume-unified-mllm.github.io/

論文の概要: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

関連論文リスト