Masked Diffusion Models are Secretly Learned-Order Autoregressive Models
- URL: http://arxiv.org/abs/2511.19152v1
- Date: Mon, 24 Nov 2025 14:17:56 GMT
- Title: Masked Diffusion Models are Secretly Learned-Order Autoregressive Models
- Authors: Prateek Garg, Bhavya Kohli, Sunita Sarawagi,
- Abstract summary: We show that Masked Diffusion Models can identify and optimize for a decoding order during training.<n>We prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders.
- Score: 21.17429712617749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.
Related papers
- Learn from Your Mistakes: Self-Correcting Masked Diffusion Models [31.536464269884103]
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models.<n>We propose a framework that trains a model to perform both unmasking and correction.<n>We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence.
arXiv Detail & Related papers (2026-02-12T05:17:31Z) - Unifying Masked Diffusion Models with Various Generation Orders and Beyond [56.70289720766803]
Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation.<n>We propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes.<n>We introduce learnable-order masked diffusion model (LoMDM) which jointly learns the generation ordering and diffusion backbone.
arXiv Detail & Related papers (2026-02-02T13:54:32Z) - Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model [74.99242687133408]
Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation.<n>We introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule.
arXiv Detail & Related papers (2025-12-25T12:06:04Z) - MDiff4STR: Mask Diffusion Model for Scene Text Recognition [59.79818820650126]
Mask Diffusion Models (MDMs) have emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks.<n>We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency.<n>We propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for Scene Text Recognition.
arXiv Detail & Related papers (2025-12-01T08:57:51Z) - Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling [48.96034602889216]
Variencoding Discrete Diffusion (VADD) is a novel framework that enhances discrete diffusion with latent variable modeling.<n>By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds and amortized inference over the training set.<n> Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines.
arXiv Detail & Related papers (2025-05-23T01:45:47Z) - Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions [32.48588058887852]
Insertion Language Models (ILMs) learn to insert tokens at arbitrary positions in a sequence.<n>ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences.
arXiv Detail & Related papers (2025-05-09T03:29:15Z) - Enabling Autoregressive Models to Fill In Masked Tokens [50.9948753314669]
This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that achieves state-of-the-art masked infilling performance.<n>MARIA combines a pre-trained and AR model by training a linear decoder that takes their hidden states as input.<n>Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
arXiv Detail & Related papers (2025-02-09T20:02:05Z) - Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference.
Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable.
We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z) - Unified Auto-Encoding with Masked Diffusion [15.264296748357157]
We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD)
UMD combines patch-based and noise-based corruption techniques within a single auto-encoding framework.
It achieves strong performance in downstream generative and representation learning tasks.
arXiv Detail & Related papers (2024-06-25T16:24:34Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.