Related papers: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

URL: http://arxiv.org/abs/2602.23225v2
Date: Fri, 27 Feb 2026 02:41:06 GMT
Title: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Authors: Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, Shiwei Liu,
Abstract summary: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs converge to left-to-right, autoregressive (AR)-like decoding dynamics.<n>We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data.<n>Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding.
Score: 48.59679063480356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

Related papers

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding [52.69880888587866]
Current Video Large Language Models (Video LLMs) typically encode frames via a encoder vision and employ an autoregressive (AR) LLM for understanding and generation.<n>We propose VidLaDA, a Diffusion Video LLM based on Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive modeling and decode tokens in parallel.<n>Experiments show VidLaDA rivals state-of-the-art AR baselines and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.
arXiv Detail & Related papers (2026-01-25T15:02:01Z)
Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow [30.201913054064363]
Masked Diffusion Language Models promise parallel token generation and arbitrary-order decoding.<n>We characterize MDLM behavior along two dimensions -- parallelism strength and generation order.<n>We evaluate eight mainstream MDLMs on 58 benchmarks spanning knowledge, reasoning, and programming.
arXiv Detail & Related papers (2026-01-22T02:39:36Z)
d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation [31.922313594074925]
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs.<n>Current methods typically focus on only one-side of the coin, targeting either efficiency or performance.<n>We propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism.
arXiv Detail & Related papers (2026-01-12T14:25:36Z)
WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference [44.87788417755154]
We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention.<n>We show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups.
arXiv Detail & Related papers (2025-12-28T01:25:48Z)
LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding [53.46134917935135]
We introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior Token Filling Order (TFO)<n>LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence.<n>We observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline.
arXiv Detail & Related papers (2025-12-18T06:22:01Z)
Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models [54.81955614221652]
parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning.<n> Behavioral analyses in both simple and complex reasoning tasks show thatDLLMs exhibit genuine parallelism only for directly decidable outputs.<n>We propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.
arXiv Detail & Related papers (2025-10-10T16:58:14Z)
ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs [31.387806058620683]
diffusion LLMs have attracted growing interest for their potential to dramatically accelerate inference through parallel decoding.<n>Existing works largely overlook these inherent challenges, and evaluations on standard benchmarks are not sufficient to capture the quality degradation caused by parallel decoding.<n>We propose ParallelBench, the first benchmark specifically designed for dLLMs, featuring realistic tasks that are trivial for humans and autoregressive LLMs yet exceptionally challenging for dLLMs under parallel decoding.<n>Our findings underscore the pressing need for innovative decoding methods that can overcome the current speed-quality trade-off.
arXiv Detail & Related papers (2025-10-06T12:41:31Z)
Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models [82.87985794856803]
Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks.<n>Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture.
arXiv Detail & Related papers (2025-10-05T10:50:52Z)
ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs [34.477777651648914]
Large language models (LLMs) pose significant inference latency challenges due to their autoregressive decoding paradigm.<n>We propose an Adaptive Serial-Parallel Decoding (ASPD) which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism.<n>Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
arXiv Detail & Related papers (2025-08-12T12:35:55Z)
Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs [57.69190972274813]
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models.<n>ExistingDLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation.<n>We introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding inDLLMs.
arXiv Detail & Related papers (2025-07-24T16:51:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.