MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
- URL: http://arxiv.org/abs/2511.18262v1
- Date: Sun, 23 Nov 2025 03:25:39 GMT
- Title: MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
- Authors: Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun,
- Abstract summary: Unified multimodal models aim to integrate understanding and generation within a single framework.<n>We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework.<n>Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks.
- Score: 20.14002849273559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.
Related papers
- LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model [77.66516875262963]
We present textbfLLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation.<n>Building on MoD, we introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings.<n>Experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks.
arXiv Detail & Related papers (2026-03-01T12:05:06Z) - Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion [60.186310080523135]
Bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders development of truly unified multimodal systems.<n>We propose textbfCoM-DAD, a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process.<n>Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
arXiv Detail & Related papers (2026-01-07T16:21:19Z) - EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture [87.55157183411507]
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing.<n>EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation.<n>2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures.
arXiv Detail & Related papers (2025-12-04T14:01:53Z) - MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters [12.063966356953186]
Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions.<n>Existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design.<n>We propose MM-DETR, a lightweight and efficient framework for multimodal object detection.
arXiv Detail & Related papers (2025-11-29T07:23:01Z) - TiDAR: Think in Diffusion, Talk in Autoregression [59.94106070312094]
TiDAR is a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively.<n> TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
arXiv Detail & Related papers (2025-11-12T02:59:33Z) - MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering [7.928163920344391]
We propose a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure.<n>MASC is designed as a plug-and-play module, and our experiments validate its effectiveness.<n>It accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58.
arXiv Detail & Related papers (2025-10-05T14:23:51Z) - Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation [63.50827603618498]
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation.<n>Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis.<n>Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing.
arXiv Detail & Related papers (2025-09-23T17:05:46Z) - FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z) - Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space [9.327655601475605]
We propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space.<n>To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy.<n>Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks.
arXiv Detail & Related papers (2025-04-30T06:30:48Z) - DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework [2.187990941788468]
generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio.
Model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture.
arXiv Detail & Related papers (2024-08-01T08:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.