EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
- URL: http://arxiv.org/abs/2512.04810v4
- Date: Tue, 09 Dec 2025 13:32:41 GMT
- Title: EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
- Authors: Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, Qi Tian,
- Abstract summary: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing.<n>EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation.<n>2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures.
- Score: 87.55157183411507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
Related papers
- Understanding and Harnessing Sparsity in Unified Multimodal Models [32.09095929575726]
Large multimodal models have achieved remarkable progress in both understanding and generation.<n>Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework.<n>Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited.
arXiv Detail & Related papers (2025-12-02T02:47:29Z) - MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation [20.14002849273559]
Unified multimodal models aim to integrate understanding and generation within a single framework.<n>We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework.<n>Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks.
arXiv Detail & Related papers (2025-11-23T03:25:39Z) - Unified Multimodal Model as Auto-Encoder [69.38946823657592]
We introduce a paradigm regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text.<n>Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception.
arXiv Detail & Related papers (2025-09-11T17:57:59Z) - METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas.<n>We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - PEM: Prototype-based Efficient MaskFormer for Image Segmentation [10.795762739721294]
Recent transformer-based architectures have shown impressive results in the field of image segmentation.
We propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks.
arXiv Detail & Related papers (2024-02-29T18:21:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.