Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions
- URL: http://arxiv.org/abs/2502.06768v2
- Date: Wed, 05 Mar 2025 19:19:48 GMT
- Title: Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions
- Authors: Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, Sitan Chen,
- Abstract summary: Masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains.<n>We show that adaptive inference can boost solving accuracy in pretrained MDMs from $7$% to $approx 90$%, even outperforming ARMs with $7times$ as many parameters.
- Score: 14.85882273040068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to $\approx 90$%, even outperforming ARMs with $7\times$ as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.
Related papers
- Learn from Your Mistakes: Self-Correcting Masked Diffusion Models [31.536464269884103]
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models.<n>We propose a framework that trains a model to perform both unmasking and correction.<n>We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence.
arXiv Detail & Related papers (2026-02-12T05:17:31Z) - Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training [21.78753228511593]
Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces.<n>This flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns.<n>We propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns.
arXiv Detail & Related papers (2026-02-10T21:42:50Z) - Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model [74.99242687133408]
Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation.<n>We introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule.
arXiv Detail & Related papers (2025-12-25T12:06:04Z) - MDiff4STR: Mask Diffusion Model for Scene Text Recognition [59.79818820650126]
Mask Diffusion Models (MDMs) have emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks.<n>We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency.<n>We propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for Scene Text Recognition.
arXiv Detail & Related papers (2025-12-01T08:57:51Z) - Masked Diffusion Models are Secretly Learned-Order Autoregressive Models [21.17429712617749]
We show that Masked Diffusion Models can identify and optimize for a decoding order during training.<n>We prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders.
arXiv Detail & Related papers (2025-11-24T14:17:56Z) - Any-Order Flexible Length Masked Diffusion [53.89217188409148]
Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains.<n>We introduce Flexible Masked Diffusion Models (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length.<n>We show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity.
arXiv Detail & Related papers (2025-08-31T23:34:53Z) - MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models [28.79185891706149]
Diffusion language models suffer from a key discrepancy between training and inference.<n>We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion.<n>Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs.
arXiv Detail & Related papers (2025-08-18T17:58:13Z) - ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs [1.1834200163382398]
ReGATE (Reference$-$Guided Adaptive Token Elision) is an adaptive token pruning method for accelerating MLLM training.<n>It matches the peak accuracy of standard training on MVBench up to 2$times$ faster, using only 35% of the tokens.
arXiv Detail & Related papers (2025-07-29T01:07:09Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs [37.50902921493273]
Training large language models (LLMs) for different inference constraints is computationally expensive.
DynaMoE adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost.
Our method achieves similar aggregated accuracy across downstream tasks, despite using only $frac19textth$ of their fine-tuning cost.
arXiv Detail & Related papers (2025-02-17T21:12:57Z) - Enabling Autoregressive Models to Fill In Masked Tokens [50.9948753314669]
This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that achieves state-of-the-art masked infilling performance.
MARIA combines a pre-trained and AR model by training a linear decoder that takes their hidden states as input.
Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
arXiv Detail & Related papers (2025-02-09T20:02:05Z) - Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning [89.96284387376119]
We show how diffusion models learn difficult subgoals that elude autoregressive approaches.
We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning.
On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques.
arXiv Detail & Related papers (2024-10-18T03:48:53Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training.
Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
arXiv Detail & Related papers (2024-07-13T09:22:33Z) - DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling [0.0]
We introduce the idea of Mixture-of-Experts (MoE) into the field of reward model (RM) training.
We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one.
Our model attains superior consistency with human preference and outstrips advanced generative approaches.
arXiv Detail & Related papers (2024-03-02T12:31:22Z) - Training Chain-of-Thought via Latent-Variable Inference [30.21067593018967]
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a chain-of-thought'' prompt.
Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers.
We propose a fine-tuning strategy that tries to maximize the emphmarginal log-likelihood of generating a correct answer using CoT prompting.
arXiv Detail & Related papers (2023-11-28T17:47:32Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - On Reinforcement Learning and Distribution Matching for Fine-Tuning
Language Models with no Catastrophic Forgetting [5.5302127686575435]
Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM)
We show that methods such as KL-control developed for RM can also be construed as belonging to DM.
We leverage connections between the two paradigms to import the concept of baseline into DM methods.
arXiv Detail & Related papers (2022-06-01T20:54:41Z) - KSM: Fast Multiple Task Adaption via Kernel-wise Soft Mask Learning [49.77278179376902]
Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as textitcatastrophic forgetting.
Recent continual learning methods are capable of alleviating the catastrophic problem on toy-sized datasets.
We propose a new training method called textit- Kernel-wise Soft Mask (KSM), which learns a kernel-wise hybrid binary and real-value soft mask for each task.
arXiv Detail & Related papers (2020-09-11T21:48:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.