Related papers: Improving Sampling for Masked Diffusion Models via Information Gain

Improving Sampling for Masked Diffusion Models via Information Gain

URL: http://arxiv.org/abs/2602.18176v1
Date: Fri, 20 Feb 2026 12:26:03 GMT
Title: Improving Sampling for Masked Diffusion Models via Information Gain
Authors: Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb,
Abstract summary: Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models.<n> Existing samplers typically adopt greedys, prioritizing positions with the highest local certainty to decode at each step.<n>We propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain.
Score: 9.059619122219502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information-Gain-Sampler.

Related papers

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models [31.536464269884103]
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models.<n>We propose a framework that trains a model to perform both unmasking and correction.<n>We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence.
arXiv Detail & Related papers (2026-02-12T05:17:31Z)
Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty [16.454646094266703]
Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge.<n>We are the first to formalize this issue, attributing variability in output quality to the cumulative predictive uncertainty along a generative path.<n>Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.
arXiv Detail & Related papers (2025-12-24T18:59:51Z)
Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models [51.12873073612084]
Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance depends on the inference time order of unmasking.<n>We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders.<n>LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection.
arXiv Detail & Related papers (2025-11-04T02:37:37Z)
PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models [33.98279129315148]
Masked diffusion models (MDMs) are powerful non-autoregressive alternatives for sequence generation.<n>In this work, we introduce Position-Aware Confidence-Calibrated Sampling (PC-Sampler), a novel decoding strategy.<n>PC-Sampler consistently outperforms existing MDM decoding strategies by more than 10% on average.
arXiv Detail & Related papers (2025-08-18T15:38:37Z)
Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance [42.737012213197865]
AdaDec is an adaptive decoding framework that employs a lookahead-based, uncertainty-aware pause-and-rerank mechanism.<n>AdaDec achieves up to 20.9% absolute gains in Pass@1 accuracy compared with greedy decoding.<n>By applying reranking only when necessary, AdaDec reduces computational overhead and latency, enhancing efficiency alongside reliability.
arXiv Detail & Related papers (2025-06-10T16:49:46Z)
LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning [1.3124513975412255]
Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects.<n>Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations.<n>We propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance.
arXiv Detail & Related papers (2025-04-28T06:52:35Z)
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results. We develop a method that avoids introducing external resources, relying instead on perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z)
Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation [49.827306773992376]
Continual Test-Time Adaptation (CTTA) is proposed to migrate a source pre-trained model to continually changing target distributions. Our proposed method attains state-of-the-art performance in both classification and segmentation CTTA tasks.
arXiv Detail & Related papers (2023-12-19T15:34:52Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
MAPS: A Noise-Robust Progressive Learning Approach for Source-Free Domain Adaptive Keypoint Detection [76.97324120775475]
Cross-domain keypoint detection methods always require accessing the source data during adaptation. This paper considers source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain.
arXiv Detail & Related papers (2023-02-09T12:06:08Z)
GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds [72.60362979456035]
Masked Autoencoders (MAE) are challenging to explore in large-scale 3D point clouds. We propose a textbfGenerative textbfDecoder for MAE (GD-MAE) to automatically merges the surrounding context. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: KITTI, and ONCE.
arXiv Detail & Related papers (2022-12-06T14:32:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.