AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
- URL: http://arxiv.org/abs/2509.26432v2
- Date: Wed, 01 Oct 2025 11:26:36 GMT
- Title: AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
- Authors: Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, Hongxiang Fan,
- Abstract summary: Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding.<n>This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding.<n>We introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime.
- Score: 7.442463267121892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.
Related papers
- Advancing Block Diffusion Language Models for Test-Time Scaling [73.54022593833638]
We propose a unified framework for test-time scaling in BDLMs.<n>We introduce adaptivity in both decoding and block-wise generation.<n>We show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines.
arXiv Detail & Related papers (2026-02-10T09:05:07Z) - DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs [17.284485483927448]
Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation.<n>The widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency.<n>We propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block.
arXiv Detail & Related papers (2026-02-05T18:41:38Z) - Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z) - Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows [33.361153168706444]
We propose Deferred Commitment Decoding (DCD) as a training-free decoding strategy.<n>DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available.<n>Experiments show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%.
arXiv Detail & Related papers (2026-01-05T12:57:33Z) - From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs [58.640039233470766]
We show that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch.<n> NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-12-07T10:28:21Z) - From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models [19.97248408121574]
Diffusion Language Models (DLMs) offer comparable accuracy with faster inference speed via parallel decoding.<n>High-confidence tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round.<n>We propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency.
arXiv Detail & Related papers (2025-11-26T06:38:37Z) - Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z) - Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding [60.06816407728172]
Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
arXiv Detail & Related papers (2025-08-27T02:49:33Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.