Related papers: Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

URL: http://arxiv.org/abs/2602.11590v1
Date: Thu, 12 Feb 2026 05:17:31 GMT
Title: Learn from Your Mistakes: Self-Correcting Masked Diffusion Models
Authors: Yair Schiff, Omer Belhasin, Roy Uziel, Guanghan Wang, Marianne Arriola, Gilad Turok, Michael Elad, Volodymyr Kuleshov,
Abstract summary: Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models.<n>We propose a framework that trains a model to perform both unmasking and correction.<n>We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence.
Score: 31.536464269884103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models, enabling parallel token generation while achieving competitive performance. Despite these advantages, MDMs face a fundamental limitation: once tokens are unmasked, they remain fixed, leading to error accumulation and ultimately degrading sample quality. We address this by proposing a framework that trains a model to perform both unmasking and correction. By reusing outputs from the MDM denoising network as inputs for corrector training, we train a model to recover from potential mistakes. During generation we apply additional corrective refinement steps between unmasking ones in order to change decoded tokens and improve outputs. We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence, including already generated tokens. We conduct extensive experimental validation across multiple conditional and unconditional tasks, demonstrating that ProSeCo yields better quality-efficiency trade-offs (up to ~2-3x faster sampling) and enables inference-time compute scaling to further increase sample quality beyond standard MDMs (up to ~1.3x improvement on benchmarks).

Related papers

Discrete Stochastic Localization for Non-autoregressive Generation [17.56505846228918]
We show that emphtraining alone can substantially improve the step-efficiency of MDLM/ReMDM sampling.<n>On OpenWebText, textsc fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM.<n>Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.
arXiv Detail & Related papers (2026-02-18T04:05:40Z)
Training-Free Self-Correction for Multimodal Masked Diffusion Models [61.84305395626145]
We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models.<n>Our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps.
arXiv Detail & Related papers (2026-02-02T23:58:15Z)
Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models [58.946955321428845]
This work presents self-rewarding sequential Monte Carlo (SMC)<n>Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy.<n>We introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights.
arXiv Detail & Related papers (2026-02-02T09:21:45Z)
Teach Diffusion Language Models to Learn from Their Own Mistakes [45.68746718883178]
Masked Diffusion Language Models (DLMs) achieve significant speed by generating multiple tokens in parallel.<n> parallel sampling approach will introduce strong dependency errors and cause quality to deteriorate rapidly as the generation step size grows.<n>We propose Decoupled Self-Correction to maintain high-quality multi-token generation.
arXiv Detail & Related papers (2026-01-10T05:04:33Z)
Learning Unmasking Policies for Diffusion Language Models [33.44995119635116]
Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks.<n>One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary.<n>In this work, we propose to train sampling procedures using reinforcement learning.
arXiv Detail & Related papers (2025-12-09T20:44:33Z)
Fine-Tuning Masked Diffusion for Provable Self-Correction [28.338622227684453]
Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces.<n>We introduce PRISM--Plug-in Remasking for Inference-time Self-correction of Masked Diffusions.
arXiv Detail & Related papers (2025-10-01T19:15:25Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
Text Generation Beyond Discrete Token Sampling [74.06071135207635]
Mixture of Inputs (MoI) is a training-free method for autoregressive generation.<n>MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B.
arXiv Detail & Related papers (2025-05-20T18:41:46Z)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [55.12082817901671]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)<n>MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.<n>Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting.
arXiv Detail & Related papers (2023-06-12T18:12:19Z)
Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict. Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.