Related papers: The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

URL: http://arxiv.org/abs/2601.15165v2
Date: Mon, 26 Jan 2026 08:29:32 GMT
Title: The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Authors: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang,
Abstract summary: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs.<n>In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs.
Score: 67.58848748317506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

Related papers

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning [62.680551162054975]
We introduce an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization.<n>We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows.<n>Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead.
arXiv Detail & Related papers (2026-02-03T08:34:20Z)
Reinforcement Learning with Promising Tokens for Large Language Models [11.420715885411925]
Reinforcement learning (RL) has emerged as a key paradigm for aligning and optimizing large language models (LLMs)<n>We introduce Reinforcement Learning with Promising Tokens (R), a framework that mitigates the action space issue by decoupling strategic decision-making from token generation.
arXiv Detail & Related papers (2026-02-03T07:08:06Z)
Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars [17.13122301190815]
We present LAVE, a constrained decoding approach specifically designed for dLLMs.<n>Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass.<n>Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead.
arXiv Detail & Related papers (2026-01-31T08:58:15Z)
Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning [16.244366307890832]
We propose textbfDeepLatent Reasoning (DLR), a latent-space bidirectional contrastive reinforcement learning framework.<n>This framework shifts the trial-and-error cost from expensive token-level full sequence generation to the continuous latent manifold.<n> Experiments demonstrate that DLR achieves more stable training convergence, supports longer-horizon reasoning chains, and facilitates the sustainable accumulation of reasoning capabilities.
arXiv Detail & Related papers (2026-01-24T03:18:22Z)
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z)
Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization [44.14678335188207]
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs)<n>Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning.<n>This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method.
arXiv Detail & Related papers (2025-10-09T13:59:50Z)
RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance [101.30279597148973]
We propose reward-free guidance (RFG) for guiding the reasoning trajectory of dLLMs without explicit process reward.<n>RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%.
arXiv Detail & Related papers (2025-09-29T23:59:16Z)
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking [38.8730008545358]
Large language models (LLMs) often lack robustness in their reasoning.<n>Our approach focuses on "abstracting" reasoning problems.<n>We find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning.
arXiv Detail & Related papers (2025-06-09T13:34:50Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.<n>We propose a novel approach for continuous-space reasoning that does not require modifying the LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.