Unified Multimodal Model as Auto-Encoder
- URL: http://arxiv.org/abs/2509.09666v3
- Date: Fri, 10 Oct 2025 09:54:16 GMT
- Title: Unified Multimodal Model as Auto-Encoder
- Authors: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan,
- Abstract summary: We introduce a paradigm regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text.<n>Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception.
- Score: 69.38946823657592
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the Auto-Encoder lens, i.e., regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. To implement this, we propose UAE, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to "understand" the fine-grained and complex semantics from the text. We then propose Unified-GRPO via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual perception; (2) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.
Related papers
- Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders [46.79030733172859]
We propose a think-then-rewrite (T2G) paradigm for text-to-image (T2I) diffusion models.<n>We show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks.<n>Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.
arXiv Detail & Related papers (2026-01-15T12:19:05Z) - Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models [23.529904770014735]
This paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images.<n>We propose Forge-and-Quench, a new unified framework that puts this principle into practice.<n>Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models.
arXiv Detail & Related papers (2026-01-08T08:18:44Z) - EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture [87.55157183411507]
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing.<n>EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation.<n>2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures.
arXiv Detail & Related papers (2025-12-04T14:01:53Z) - VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z) - SEDEG:Sequential Enhancement of Decoder and Encoder's Generality for Class Incremental Learning with Small Memory [11.197556113382186]
In incremental learning, enhancing the generality of knowledge is crucial for adapting to dynamic data inputs.<n>SEDEG trains an ensembled encoder through feature boosting to learn generalized representations.<n>The next stage involves using knowledge distillation strategies to compress the ensembled encoder and develop a new, more generalized encoder.
arXiv Detail & Related papers (2025-08-18T13:55:59Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies [25.77487827338777]
A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details.<n>A vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks.<n>We propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer.
arXiv Detail & Related papers (2025-03-18T14:56:46Z) - QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation [101.28446308930367]
Quantized Language-Image Pretraining (QLIP) combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding.<n>QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives.<n>We demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
arXiv Detail & Related papers (2025-02-07T18:59:57Z) - Epsilon-VAE: Denoising as Visual Decoding [61.29255979767292]
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement.<n>Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image.<n>By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality.
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval [67.52910255064762]
We design a simple dual-stream structure, including a temporal layer and a hash layer.
We first design a simple dual-stream structure, including a temporal layer and a hash layer.
With the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval.
In this way, the model naturally preserves the disentangled semantics into binary codes.
arXiv Detail & Related papers (2023-10-12T03:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.