Related papers: Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective

Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective

URL: http://arxiv.org/abs/2602.00286v1
Date: Fri, 30 Jan 2026 20:15:18 GMT
Title: Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective
Authors: Shaorong Zhang, Longxuan Yu, Rob Brekelmans, Luhan Tang, Salman Asif, Greg Ver Steeg,
Abstract summary: Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism.<n>We provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias.
Score: 16.942478643768144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism. However, the theoretical mechanisms governing generation order and the risks inherent in parallelization remain under-explored. In this work, we provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias. Our analysis yields three key insights: (1) The benefits of Easy-First decoding (prioritizing low-entropy tokens) are magnified as model error increases; (2) factorized parallel decoding introduces intrinsic sampling errors that can lead to arbitrary large Reverse KL divergence, capturing "incoherence" failures that standard Forward KL metrics overlook; and (3) while verification can eliminate sampling error, it incurs an exponential cost governed by the total correlation within a block. Conversely, heuristics like remasking, though computationally efficient, cannot guarantee distributional correctness. Experiments on a controlled Block-HMM and large-scale MDMs (LLaDA) for arithmetic reasoning validate our theoretical framework.

Related papers

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z)
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning [17.384089089363382]
We identify a root cause that existing methods overlook: the uniform penalization of errors.<n>Current approaches treat all incorrect rollouts within a group identically.<n>We propose the Asymmetric Confidence-aware Error Penalty (ACE)
arXiv Detail & Related papers (2026-02-24T22:46:43Z)
MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning [16.012761588513026]
Reinforcement Learning with Verifiable Rewards (RLVR) algorithms rely on rigid, uniform, and symmetric trust region mechanisms.<n>We propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions.<n> MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence.
arXiv Detail & Related papers (2026-02-19T17:05:20Z)
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Learning Causality for Longitudinal Data [1.2691047660244335]
This thesis develops methods for causal inference and causal representation learning in high-dimensional, time-varying data.<n>The first contribution introduces the Causal Dynamic Variational Autoencoder (CDVAE), a model for estimating Individual Treatment Effects (ITEs)<n>The second contribution proposes an efficient framework for long-term counterfactual regression based on RNNs enhanced with Contrastive Predictive Coding ( CPC) and InfoMax.<n>The third contribution advances CRL by addressing how latent causes manifest in observed variables.
arXiv Detail & Related papers (2025-12-04T16:51:49Z)
Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching [14.503330877000758]
Time-Conditioned Contraction Matching is a novel method for semi-supervised anomaly detection in tabular data.<n>It is inspired by flow matching, a recent generative modeling framework that learns velocity fields between probability distributions.<n>Extensive experiments on the ADBench benchmark show that TCCM strikes a favorable balance between detection accuracy and inference cost.
arXiv Detail & Related papers (2025-10-21T06:26:38Z)
Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models [54.81955614221652]
parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning.<n> Behavioral analyses in both simple and complex reasoning tasks show thatDLLMs exhibit genuine parallelism only for directly decidable outputs.<n>We propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.
arXiv Detail & Related papers (2025-10-10T16:58:14Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
Theoretical Benefit and Limitation of Diffusion Language Model [47.579673047639126]
Diffusion language models have emerged as a promising approach for text generation.<n>We present a rigorous theoretical analysis of a widely used type of diffusion language model, the Masked Diffusion Model (MDM)<n>Our analysis establishes the first theoretical foundation for understanding the benefits and limitations of MDMs.
arXiv Detail & Related papers (2025-02-13T18:59:47Z)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z)
Rectified Diffusion Guidance for Conditional Generation [94.83538269086613]
We revisit the theory behind CFG and rigorously confirm that the improper combination coefficients (textiti.e.) brings about expectation shift the generative distribution.<n>We show that our approach enjoys a textbftextitform solution given the strength.<n> Empirical evidence on real-world data demonstrate the compatibility of our design with existing state-of-the-art diffusion models.
arXiv Detail & Related papers (2024-10-24T13:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.