Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models
- URL: http://arxiv.org/abs/2512.14008v1
- Date: Tue, 16 Dec 2025 02:06:06 GMT
- Title: Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models
- Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen,
- Abstract summary: We propose Sparse-LaViDa, a modeling framework that truncates unnecessary masked tokens at each inference step to accelerate MDM sampling.<n>Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks.
- Score: 63.50827603618498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.
Related papers
- Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridge is a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability.<n>We propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.
arXiv Detail & Related papers (2025-10-02T00:40:02Z) - Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation [63.50827603618498]
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation.<n>Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis.<n>Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing.
arXiv Detail & Related papers (2025-09-23T17:05:46Z) - LlamaSeg: Image Segmentation via Autoregressive Mask Generation [46.17509085054758]
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions.<n>We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs.
arXiv Detail & Related papers (2025-05-26T02:22:41Z) - Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking [28.55159825491572]
Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence.<n>We propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states between the masked and unmasked states.<n>Our method demonstrates superior performance across a diverse set of generative modeling tasks.
arXiv Detail & Related papers (2025-05-24T04:16:40Z) - Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation.<n>Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order.<n>In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z) - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - TimeMAE: Self-Supervised Representations of Time Series with Decoupled
Masked Autoencoders [55.00904795497786]
We propose TimeMAE, a novel self-supervised paradigm for learning transferrable time series representations based on transformer networks.
The TimeMAE learns enriched contextual representations of time series with a bidirectional encoding scheme.
To solve the discrepancy issue incurred by newly injected masked embeddings, we design a decoupled autoencoder architecture.
arXiv Detail & Related papers (2023-03-01T08:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.