Related papers: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

URL: http://arxiv.org/abs/2504.01934v2
Date: Thu, 03 Apr 2025 16:43:14 GMT
Title: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
Authors: Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu,
Abstract summary: Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.<n>ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.<n>We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
Score: 68.05833403672274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.

Related papers

Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing [7.278180096265984]
Nexus-Gen is a unified model that synergizes the language reasoning capabilities of multimodal large language models with the image synthesis power of diffusion models. We introduce a prefilled autoregression strategy that prefills input sequence with position-embedded special tokens instead of continuous embeddings.
arXiv Detail & Related papers (2025-04-30T06:30:48Z)
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order. In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z)
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy [14.703591553247948]
ARMOR is a framework that achieves both understanding and generation by fine-tuning existing multimodal large language models. ARMOR extends existing MLLMs from three perspectives: model architecture, training data, and training algorithm. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities.
arXiv Detail & Related papers (2025-03-09T10:15:39Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding. Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z)
LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing. We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.