Related papers: d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation

URL: http://arxiv.org/abs/2601.07568v1
Date: Mon, 12 Jan 2026 14:25:36 GMT
Title: d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation
Authors: Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, Hao Zhang,
Abstract summary: Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs.<n>Current methods typically focus on only one-side of the coin, targeting either efficiency or performance.<n>We propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism.
Score: 31.922313594074925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10$\times$ speedup over vanilla LLaDA/Dream and 5$\times$ speedup over AR models without much accuracy drop. Our code is available at https://github.com/hao-ai-lab/d3LLM.

Related papers

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z)
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing [41.89066334075016]
Jacobi Forcing is a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories.<n>We introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup.
arXiv Detail & Related papers (2025-12-16T18:45:18Z)
Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way [23.877854550033224]
Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation.<n>Current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding.<n>We propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var.
arXiv Detail & Related papers (2025-10-28T16:32:43Z)
dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z)
Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z)
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing [14.22753953706955]
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation.<n>This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F)<n>In this way, vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference.
arXiv Detail & Related papers (2025-08-08T04:51:37Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
Large Language Diffusion Models [93.26422905620008]
Large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs)<n>We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm.<n>Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.