Advancing Block Diffusion Language Models for Test-Time Scaling
- URL: http://arxiv.org/abs/2602.09555v2
- Date: Wed, 11 Feb 2026 03:38:52 GMT
- Title: Advancing Block Diffusion Language Models for Test-Time Scaling
- Authors: Yi Lu, Deyang Kong, Jianing Wang, Linsen Guo, Xue Wang, Qi Guo, Tao Gui, Xuanjing Huang, Wei Ye, Shikun Zhang, Wei Wang,
- Abstract summary: We propose a unified framework for test-time scaling in BDLMs.<n>We introduce adaptivity in both decoding and block-wise generation.<n>We show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines.
- Score: 73.54022593833638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.
Related papers
- DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs [17.284485483927448]
Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation.<n>The widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency.<n>We propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block.
arXiv Detail & Related papers (2026-02-05T18:41:38Z) - Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z) - Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows [33.361153168706444]
We propose Deferred Commitment Decoding (DCD) as a training-free decoding strategy.<n>DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available.<n>Experiments show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%.
arXiv Detail & Related papers (2026-01-05T12:57:33Z) - Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size [7.442463267121892]
Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding.<n>This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding.<n>We introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime.
arXiv Detail & Related papers (2025-09-30T15:53:56Z) - ATTS: Asynchronous Test-Time Scaling via Conformal Prediction [112.54016379556073]
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.<n>We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework.<n>We show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement.
arXiv Detail & Related papers (2025-09-18T16:55:09Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - Block-wise Adaptive Caching for Accelerating Diffusion Policy [10.641633189595302]
Block-wise Adaptive Caching(BAC) is a method to accelerate Diffusion Policy by caching intermediate action features.<n>BAC achieves up to 3x inference speedup for free on robotic benchmarks.
arXiv Detail & Related papers (2025-06-16T13:14:58Z) - Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models [4.078176555898098]
We introduce and evaluate Token Constraint Decoding (TCD)<n>This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings.<n>Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections.
arXiv Detail & Related papers (2025-06-11T05:33:56Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.