Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training
- URL: http://arxiv.org/abs/2602.10314v1
- Date: Tue, 10 Feb 2026 21:42:50 GMT
- Title: Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training
- Authors: Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, Sitan Chen,
- Abstract summary: Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces.<n>This flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns.<n>We propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns.
- Score: 21.78753228511593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by $\approx 2.5\times$ and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.
Related papers
- MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models [28.79185891706149]
Diffusion language models suffer from a key discrepancy between training and inference.<n>We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion.<n>Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs.
arXiv Detail & Related papers (2025-08-18T17:58:13Z) - Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions [14.85882273040068]
Masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains.<n>We show that adaptive inference can boost solving accuracy in pretrained MDMs from $7$% to $approx 90$%, even outperforming ARMs with $7times$ as many parameters.
arXiv Detail & Related papers (2025-02-10T18:47:21Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Masking meets Supervision: A Strong Learning Alliance [45.04910405404371]
We propose a novel way to involve masking augmentations dubbed Masked Sub-branch (MaskSub)<n>The main-branch undergoes conventional training recipes, while the sub-branch merits intensive masking augmentations, during training.<n>MaskSub tackles the challenge by mitigating adverse effects through a relaxed loss function similar to a self-distillation loss.
arXiv Detail & Related papers (2023-06-20T07:17:38Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - Difference-Masking: Choosing What to Mask in Continued Pretraining [56.76782116221438]
We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining.
We find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks.
arXiv Detail & Related papers (2023-05-23T23:31:02Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Accelerated Sparse Neural Training: A Provable and Efficient Method to
Find N:M Transposable Masks [28.498176073737422]
Recently, researchers proposed pruning deep neural network weights (DNNs) using an $N:M$ fine-grained block sparsity mask.
We propose a novel transposable-fine-grained sparsity mask where the same mask can be used for both forward and backward passes.
Our experiments suggest 2x speed-up with no accuracy degradation over vision and language models.
arXiv Detail & Related papers (2021-02-16T12:44:16Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.