Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
- URL: http://arxiv.org/abs/2602.23792v1
- Date: Fri, 27 Feb 2026 08:36:06 GMT
- Title: Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
- Authors: Xiangzhong Luo, Yilin An, Zhicheng Yu, Weichen Liu, Xu Yang,
- Abstract summary: Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks.<n>We introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm.<n>Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.
- Score: 6.755667885643806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that generate one token per step based on all previous tokens, dLLMs theoretically enable parallel generation of multiple tokens at each decoding step. However, recent dLLMs still favor one-token-per-step generation in practice, as directly decoding multiple masked tokens often leads to degraded generation quality and stability. This reveals a substantial gap between the theoretical parallelism and practical performance of dLLMs. To bridge this gap, we introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm to unleash the inherent parallelism of dLLMs. During the Divide phase, DiCo first explores the input masked sequence and identifies masked tokens as seed tokens, which are then expanded to construct a set of local clusters. During the Conquer phase, DiCo performs parallel decoding across different local clusters constructed in the Divide phase. The divide-and-conquer process repeatedly alternates between the Divide and Conquer phases until convergence. During the Finalize phase, DiCo decodes the remaining few masked tokens using an effective fine-grained compound decoding scheme to finalize the generation. Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.
Related papers
- Residual Context Diffusion Language Models [90.07635240595926]
Residual Context Diffusion (RCD) is a module that converts discarded token representations into contextual residuals and injects them back for the next denoising step.<n>RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead.
arXiv Detail & Related papers (2026-01-30T13:16:32Z) - Diffusion Language Models are Provably Optimal Parallel Samplers [15.981424915336001]
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models.<n>We show that DLMs augmented with a chain-of-thought can simulate any parallel sampling algorithm using an optimal number of sequential steps.
arXiv Detail & Related papers (2025-12-31T18:03:05Z) - WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference [44.87788417755154]
We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention.<n>We show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups.
arXiv Detail & Related papers (2025-12-28T01:25:48Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z) - Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding [21.609237262034636]
Autoregressive decoding in large language models (LLMs) requires $mathcalO(n)$ sequential steps for $n$ tokens.<n>We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output.<n>This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted.
arXiv Detail & Related papers (2025-09-29T17:59:54Z) - METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z) - AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism [17.858104076062897]
Large language models (LLMs) are increasingly used for long-content generation.<n>We propose AdaDecode, which accelerates decoding without requiring auxiliary models or changes to the original model parameters.<n>AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup.
arXiv Detail & Related papers (2025-06-04T08:32:30Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.