Related papers: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

URL: http://arxiv.org/abs/2508.09192v1
Date: Fri, 08 Aug 2025 04:51:37 GMT
Title: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
Authors: Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng,
Abstract summary: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation.<n>This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F)<n>In this way, vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference.
Score: 14.22753953706955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.

Related papers

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
Residual Context Diffusion Language Models [90.07635240595926]
Residual Context Diffusion (RCD) is a module that converts discarded token representations into contextual residuals and injects them back for the next denoising step.<n>RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead.
arXiv Detail & Related papers (2026-01-30T13:16:32Z)
VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding [52.69880888587866]
Current Video Large Language Models (Video LLMs) typically encode frames via a encoder vision and employ an autoregressive (AR) LLM for understanding and generation.<n>We propose VidLaDA, a Diffusion Video LLM based on Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive modeling and decode tokens in parallel.<n>Experiments show VidLaDA rivals state-of-the-art AR baselines and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.
arXiv Detail & Related papers (2026-01-25T15:02:01Z)
d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation [31.922313594074925]
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs.<n>Current methods typically focus on only one-side of the coin, targeting either efficiency or performance.<n>We propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism.
arXiv Detail & Related papers (2026-01-12T14:25:36Z)
From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs [58.640039233470766]
We show that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch.<n> NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-12-07T10:28:21Z)
Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way [23.877854550033224]
Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation.<n>Current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding.<n>We propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var.
arXiv Detail & Related papers (2025-10-28T16:32:43Z)
dInfer: An Efficient Inference Framework for Diffusion Language Models [54.80918957287927]
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs.<n>We present dInfer, an efficient and framework for dLLM inference.
arXiv Detail & Related papers (2025-10-09T16:19:42Z)
Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z)
Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding [40.96405124314983]
Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs)<n>Currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep.<n>We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $mathbf2.8-3.1times$ while provably preserving the model's output distribution.
arXiv Detail & Related papers (2025-09-22T17:58:21Z)
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation [68.19756761027351]
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models.<n>We investigate their denoising processes and reinforcement learning methods.<n>Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
arXiv Detail & Related papers (2025-06-25T17:35:47Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [50.9948753314669]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality.<n>We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.<n>Our experiments demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.