Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
- URL: http://arxiv.org/abs/2510.05725v1
- Date: Tue, 07 Oct 2025 09:44:24 GMT
- Title: Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
- Authors: Chunsan Hong, Seonho An, Min-Soo Kim, Jong Chul Ye,
- Abstract summary: We cast denoising as a KL-regularized Markov decision process (MDP) with an explicit reference policy and optimize a regularized objective.<n>We prove that the optimized policy under this framework generates samples that more closely match the data distribution than schedules.
- Score: 47.6755955972232
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked diffusion models (MDMs) have recently emerged as a novel framework for language modeling. MDMs generate sentences by iteratively denoising masked sequences, filling in [MASK] tokens step by step. Although MDMs support any-order sampling, performance is highly sensitive to the choice of which position to unmask next. Prior work typically relies on rule-based schedules (e.g., max-confidence, max-margin), which provide ad hoc improvements. In contrast, we replace these heuristics with a learned scheduler. Specifically, we cast denoising as a KL-regularized Markov decision process (MDP) with an explicit reference policy and optimize a regularized objective that admits policy improvement and convergence guarantees under standard assumptions. We prove that the optimized policy under this framework generates samples that more closely match the data distribution than heuristic schedules. Empirically, across four benchmarks, our learned policy consistently outperforms max-confidence: for example, on SUDOKU, where unmasking order is critical, it yields a 20.1% gain over random and a 11.2% gain over max-confidence.
Related papers
- CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z) - Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model [74.99242687133408]
Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation.<n>We introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule.
arXiv Detail & Related papers (2025-12-25T12:06:04Z) - Learning Unmasking Policies for Diffusion Language Models [33.44995119635116]
Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks.<n>One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary.<n>In this work, we propose to train sampling procedures using reinforcement learning.
arXiv Detail & Related papers (2025-12-09T20:44:33Z) - Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models [51.12873073612084]
Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance depends on the inference time order of unmasking.<n>We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders.<n>LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection.
arXiv Detail & Related papers (2025-11-04T02:37:37Z) - Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing [4.707859580472452]
Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation.<n>They enable parallel token sampling, rather than sequential, left-to-right generation.<n>We present PUNT, a model-agnostic sampler that reconciles this trade-off.
arXiv Detail & Related papers (2025-10-24T18:41:26Z) - MARS-Sep: Multimodal-Aligned Reinforced Sound Separation [72.85468563236005]
MARS-Sep is a reinforcement learning framework for sound separation.<n>It learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate.<n>Experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation.
arXiv Detail & Related papers (2025-10-12T09:05:28Z) - Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models [13.575063025878208]
Masked diffusion language models promise fast, non-autoregressive text generation.<n>Existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel.<n>We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel.
arXiv Detail & Related papers (2025-06-23T18:49:23Z) - Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.<n>We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.<n>We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z) - Off-Policy Evaluation in Markov Decision Processes under Weak Distributional Overlap [3.351714665243138]
We re-visit the task of off-policy evaluation in Markov decision processes (MDPs) under a weaker notion of distributional overlap.<n>We introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting.<n>We find that, in our experiments, appropriate truncation plays a major role in enabling accurate off-policy evaluation when strong distributional overlap does not hold.
arXiv Detail & Related papers (2024-02-13T03:55:56Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Exposing the Implicit Energy Networks behind Masked Language Models via
Metropolis--Hastings [57.133639209759615]
We interpret sequences as energy-based sequence models and propose two energy parametrizations derivable from traineds.
We develop a tractable emph scheme based on the Metropolis-Hastings Monte Carlo algorithm.
We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energy-based models.
arXiv Detail & Related papers (2021-06-04T22:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.