LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
- URL: http://arxiv.org/abs/2603.01068v1
- Date: Sun, 01 Mar 2026 12:05:06 GMT
- Title: LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
- Authors: Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen,
- Abstract summary: We present textbfLLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation.<n>Building on MoD, we introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings.<n>Experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks.
- Score: 77.66516875262963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.
Related papers
- MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation [20.14002849273559]
Unified multimodal models aim to integrate understanding and generation within a single framework.<n>We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework.<n>Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks.
arXiv Detail & Related papers (2025-11-23T03:25:39Z) - Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone [6.76700377196741]
We introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone.<n>Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs.
arXiv Detail & Related papers (2025-11-19T23:23:49Z) - Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding [134.93925077411564]
Lumina-DiMOO is an open-source foundational model for seamless multi-modal generation and understanding.<n>It uses a fully discrete diffusion modeling to handle inputs and outputs across various modalities.<n>It achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models.
arXiv Detail & Related papers (2025-10-07T17:59:20Z) - Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces [10.85468238780625]
We propose a novel framework for building multimodal diffusion models on arbitrary state spaces.<n>By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously.
arXiv Detail & Related papers (2025-06-09T16:20:20Z) - LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning [71.98260064022452]
We introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models.<n>Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and connector that projects visual features into the language embedding space.
arXiv Detail & Related papers (2025-05-22T17:23:26Z) - DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers [86.5541501589166]
DiffMoE is a batch-level global token pool that enables experts to access global token distributions during training.<n>It achieves state-of-the-art performance among diffusion models on ImageNet benchmark.<n>The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation.
arXiv Detail & Related papers (2025-03-18T17:57:07Z) - Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.<n>We generalize a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes.<n>Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality.
arXiv Detail & Related papers (2025-03-06T14:30:55Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.