LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
- URL: http://arxiv.org/abs/2512.16229v2
- Date: Mon, 22 Dec 2025 13:29:11 GMT
- Title: LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
- Authors: Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, Zhijie Deng,
- Abstract summary: We introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior Token Filling Order (TFO)<n>LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence.<n>We observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline.
- Score: 53.46134917935135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.
Related papers
- Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? [48.59679063480356]
Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs converge to left-to-right, autoregressive (AR)-like decoding dynamics.<n>We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data.<n>Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding.
arXiv Detail & Related papers (2026-02-26T17:04:57Z) - Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing [76.48164395646019]
Parallel-Probe is a training-free controller designed to optimize online parallel thinking.<n>It reduces sequential tokens by up to $textbf35.8$% and total token cost by over $textbf25.8$% while maintaining competitive accuracy.
arXiv Detail & Related papers (2026-02-03T18:59:41Z) - d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation [31.922313594074925]
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs.<n>Current methods typically focus on only one-side of the coin, targeting either efficiency or performance.<n>We propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism.
arXiv Detail & Related papers (2026-01-12T14:25:36Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z) - Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding [21.609237262034636]
Autoregressive decoding in large language models (LLMs) requires $mathcalO(n)$ sequential steps for $n$ tokens.<n>We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output.<n>This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted.
arXiv Detail & Related papers (2025-09-29T17:59:54Z) - Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing [14.22753953706955]
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation.<n>This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F)<n>In this way, vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference.
arXiv Detail & Related papers (2025-08-08T04:51:37Z) - FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference [9.279335822985441]
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge.<n>Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency.<n>We propose FlowSpec, a pipeline-parallel tree-based speculative decoding framework.
arXiv Detail & Related papers (2025-07-03T13:47:42Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.