Related papers: MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

URL: http://arxiv.org/abs/2511.18262v1
Date: Sun, 23 Nov 2025 03:25:39 GMT
Title: MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
Authors: Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun,
Abstract summary: Unified multimodal models aim to integrate understanding and generation within a single framework.<n>We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework.<n>Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks.
Score: 20.14002849273559
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

Related papers

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model [77.66516875262963]
We present textbfLLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation.<n>Building on MoD, we introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings.<n>Experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks.
arXiv Detail & Related papers (2026-03-01T12:05:06Z)
Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion [60.186310080523135]
Bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders development of truly unified multimodal systems.<n>We propose textbfCoM-DAD, a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process.<n>Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
arXiv Detail & Related papers (2026-01-07T16:21:19Z)
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture [87.55157183411507]
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing.<n>EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation.<n>2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures.
arXiv Detail & Related papers (2025-12-04T14:01:53Z)
MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters [12.063966356953186]
Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions.<n>Existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design.<n>We propose MM-DETR, a lightweight and efficient framework for multimodal object detection.
arXiv Detail & Related papers (2025-11-29T07:23:01Z)
TiDAR: Think in Diffusion, Talk in Autoregression [59.94106070312094]
TiDAR is a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively.<n> TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
arXiv Detail & Related papers (2025-11-12T02:59:33Z)
MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering [7.928163920344391]
We propose a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure.<n>MASC is designed as a plug-and-play module, and our experiments validate its effectiveness.<n>It accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58.
arXiv Detail & Related papers (2025-10-05T14:23:51Z)
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation [63.50827603618498]
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation.<n>Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis.<n>Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing.
arXiv Detail & Related papers (2025-09-23T17:05:46Z)
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space [9.327655601475605]
We propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space.<n>To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy.<n>Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks.
arXiv Detail & Related papers (2025-04-30T06:30:48Z)
DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework [2.187990941788468]
generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio. Model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture.
arXiv Detail & Related papers (2024-08-01T08:22:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.