Related papers: Discrete Stochastic Localization for Non-autoregressive Generation

Discrete Stochastic Localization for Non-autoregressive Generation

URL: http://arxiv.org/abs/2602.16169v1
Date: Wed, 18 Feb 2026 04:05:40 GMT
Title: Discrete Stochastic Localization for Non-autoregressive Generation
Authors: Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg,
Abstract summary: We show that emphtraining alone can substantially improve the step-efficiency of MDLM/ReMDM sampling.<n>On OpenWebText, textsc fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM.<n>Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.
Score: 17.56505846228918
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with $\sim$4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

Related papers

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models [31.536464269884103]
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models.<n>We propose a framework that trains a model to perform both unmasking and correction.<n>We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence.
arXiv Detail & Related papers (2026-02-12T05:17:31Z)
Training-Free Self-Correction for Multimodal Masked Diffusion Models [61.84305395626145]
We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models.<n>Our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps.
arXiv Detail & Related papers (2026-02-02T23:58:15Z)
Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models [58.946955321428845]
This work presents self-rewarding sequential Monte Carlo (SMC)<n>Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy.<n>We introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights.
arXiv Detail & Related papers (2026-02-02T09:21:45Z)
Teach Diffusion Language Models to Learn from Their Own Mistakes [45.68746718883178]
Masked Diffusion Language Models (DLMs) achieve significant speed by generating multiple tokens in parallel.<n> parallel sampling approach will introduce strong dependency errors and cause quality to deteriorate rapidly as the generation step size grows.<n>We propose Decoupled Self-Correction to maintain high-quality multi-token generation.
arXiv Detail & Related papers (2026-01-10T05:04:33Z)
Learning Unmasking Policies for Diffusion Language Models [33.44995119635116]
Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks.<n>One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary.<n>In this work, we propose to train sampling procedures using reinforcement learning.
arXiv Detail & Related papers (2025-12-09T20:44:33Z)
Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z)
Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall [28.243098541421755]
We introduce Loopholing, a novel and simple mechanism that preserves information via a deterministic latent pathway.<n>LDDMs achieve substantial gains-reducing generative perplexity by up to 61% over prior baselines.<n>Results also indicate that loopholing mitigates idle steps and oscillations, providing a scalable path toward high-quality non-autoregressive text generation.
arXiv Detail & Related papers (2025-10-22T07:08:47Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation [92.42032403795879]
We show that pretrained language models (LMs) such as GPT2 still tend to generate repetitive texts. We attribute their overestimation of token-level repetition probabilities to the learning bias. We find that LMs use longer-range dependencies to predict repetitive tokens than non-repetitive ones, which may be the cause of sentence-level repetition loops.
arXiv Detail & Related papers (2023-07-04T07:53:55Z)
An EM Approach to Non-autoregressive Conditional Sequence Generation [49.11858479436565]
Autoregressive (AR) models have been the dominating approach to conditional sequence generation. Non-autoregressive (NAR) models have been recently proposed to reduce the latency by generating all output tokens in parallel. This paper proposes a new approach that jointly optimize both AR and NAR models in a unified Expectation-Maximization framework.
arXiv Detail & Related papers (2020-06-29T20:58:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.