Related papers: Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

URL: http://arxiv.org/abs/2602.10314v1
Date: Tue, 10 Feb 2026 21:42:50 GMT
Title: Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training
Authors: Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, Sitan Chen,
Abstract summary: Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces.<n>This flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns.<n>We propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns.
Score: 21.78753228511593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by $\approx 2.5\times$ and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.

Related papers

MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models [28.79185891706149]
Diffusion language models suffer from a key discrepancy between training and inference.<n>We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion.<n>Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs.
arXiv Detail & Related papers (2025-08-18T17:58:13Z)
Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions [14.85882273040068]
Masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains.<n>We show that adaptive inference can boost solving accuracy in pretrained MDMs from $7$% to $approx 90$%, even outperforming ARMs with $7times$ as many parameters.
arXiv Detail & Related papers (2025-02-10T18:47:21Z)
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework. We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z)
CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z)
Masking meets Supervision: A Strong Learning Alliance [45.04910405404371]
We propose a novel way to involve masking augmentations dubbed Masked Sub-branch (MaskSub)<n>The main-branch undergoes conventional training recipes, while the sub-branch merits intensive masking augmentations, during training.<n>MaskSub tackles the challenge by mitigating adverse effects through a relaxed loss function similar to a self-distillation loss.
arXiv Detail & Related papers (2023-06-20T07:17:38Z)
Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers. Specifically, we randomly mask out a high proportion of patches in diffused input images during training. Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z)
Difference-Masking: Choosing What to Mask in Continued Pretraining [56.76782116221438]
We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining. We find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks.
arXiv Detail & Related papers (2023-05-23T23:31:02Z)
Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning. We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z)
Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks [28.498176073737422]
Recently, researchers proposed pruning deep neural network weights (DNNs) using an $N:M$ fine-grained block sparsity mask. We propose a novel transposable-fine-grained sparsity mask where the same mask can be used for both forward and backward passes. Our experiments suggest 2x speed-up with no accuracy degradation over vision and language models.
arXiv Detail & Related papers (2021-02-16T12:44:16Z)
Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z)
Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict. Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.