Any-Order Flexible Length Masked Diffusion
- URL: http://arxiv.org/abs/2509.01025v2
- Date: Sun, 07 Sep 2025 22:48:13 GMT
- Title: Any-Order Flexible Length Masked Diffusion
- Authors: Jaeyeon Kim, Lee Cheuk-Kit, Carles Domingo-Enrich, Yilun Du, Sham Kakade, Timothy Ngotiaoco, Sitan Chen, Michael Albergo,
- Abstract summary: Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains.<n>We introduce Flexible Masked Diffusion Models (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length.<n>We show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity.
- Score: 53.89217188409148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains. MDMs generate sequences in an any-order, parallel fashion, enabling fast inference and strong performance on non-causal tasks. However, a crucial limitation is that they do not support token insertions and are thus limited to fixed-length generations. To this end, we introduce Flexible Masked Diffusion Models (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length while provably retaining MDMs' flexibility of any-order inference. Grounded in an extension of the stochastic interpolant framework, FlexMDMs generate sequences by inserting mask tokens and unmasking them. Empirically, we show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity. On a synthetic maze planning task, they achieve $\approx 60 \%$ higher success rate than MDM baselines. Finally, we show pretrained MDMs can easily be retrofitted into FlexMDMs: on 16 H100s, it takes only three days to fine-tune LLaDA-8B into a FlexMDM, achieving superior performance on math (GSM8K, $58\% \to 67\%$) and code infilling performance ($52\% \to 65\%$).
Related papers
- DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking [13.905201743303214]
Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions.<n>Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution.<n>We introduce the textscDUEL framework, which formalizes emphdeterministic position selection, unifying leading MDM sampling strategies.
arXiv Detail & Related papers (2026-03-02T01:56:03Z) - Unifying Masked Diffusion Models with Various Generation Orders and Beyond [56.70289720766803]
Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation.<n>We propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes.<n>We introduce learnable-order masked diffusion model (LoMDM) which jointly learns the generation ordering and diffusion backbone.
arXiv Detail & Related papers (2026-02-02T13:54:32Z) - Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model [74.99242687133408]
Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation.<n>We introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule.
arXiv Detail & Related papers (2025-12-25T12:06:04Z) - Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models [63.50827603618498]
We propose Sparse-LaViDa, a modeling framework that truncates unnecessary masked tokens at each inference step to accelerate MDM sampling.<n>Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks.
arXiv Detail & Related papers (2025-12-16T02:06:06Z) - Sequential Diffusion Language Models [110.06562906987052]
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value caches.<n>We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction.<n>We propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost.
arXiv Detail & Related papers (2025-09-28T17:59:15Z) - Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding [53.82301522384719]
We propose Dimple, the first Discrete Multimodal Large Language Model (DMLLM)<n>We design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.<n>Dimple-7B surpasses LLaVA- in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-05-22T17:55:04Z) - Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions [32.48588058887852]
Insertion Language Models (ILMs) learn to insert tokens at arbitrary positions in a sequence.<n>ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences.
arXiv Detail & Related papers (2025-05-09T03:29:15Z) - Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions [14.85882273040068]
Masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains.<n>We show that adaptive inference can boost solving accuracy in pretrained MDMs from $7$% to $approx 90$%, even outperforming ARMs with $7times$ as many parameters.
arXiv Detail & Related papers (2025-02-10T18:47:21Z) - Enabling Autoregressive Models to Fill In Masked Tokens [50.9948753314669]
This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that achieves state-of-the-art masked infilling performance.<n>MARIA combines a pre-trained and AR model by training a linear decoder that takes their hidden states as input.<n>Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
arXiv Detail & Related papers (2025-02-09T20:02:05Z) - Scaling up Masked Diffusion Models on Text [43.16800764711572]
Masked diffusion models (MDMs) have shown promise in language modeling.<n>This paper establishes the first scaling law for MDMs.<n>We train a family of MDMs with up to 1.1 billion (B) parameters to evaluate their performance against larger sizes.
arXiv Detail & Related papers (2024-10-24T08:01:22Z) - Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning [89.96284387376119]
We show how diffusion models learn difficult subgoals that elude autoregressive approaches.<n>We propose Multi-Granularity Diffusion Modeling (MGDM), which prioritizes subgoals based on difficulty during learning.<n>MGDM significantly outperforms autoregressive models without using search techniques.
arXiv Detail & Related papers (2024-10-18T03:48:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.