SparseD: Sparse Attention for Diffusion Language Models
- URL: http://arxiv.org/abs/2509.24014v1
- Date: Sun, 28 Sep 2025 18:10:10 GMT
- Title: SparseD: Sparse Attention for Diffusion Language Models
- Authors: Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang,
- Abstract summary: diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs)<n>Existing open-source DLMs suffer from high inference latency.<n>We propose SparseD, a novel sparse attention method for DLMs.
- Score: 98.05780626106555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.
Related papers
- EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients [6.736735746633275]
Diffusion-based large language models (dLLMs) refine token generations through iterative denoising, but answers often stabilize before all steps complete.<n>We propose EDIT, an inference-time criterion that adaptively stops denoising once sufficient reasoning stability relative to training-time reasoning is detected.
arXiv Detail & Related papers (2025-11-29T23:47:47Z) - LiteAttention: A Temporal Sparse Attention for Diffusion Transformers [1.3471268811218626]
LiteAttention exploits temporal coherence to enable evolutionary computation skips across the denoising sequence.<n>We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models.
arXiv Detail & Related papers (2025-11-14T08:26:55Z) - Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z) - Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion [16.99620863197586]
Diffusion language models offer parallel token generation and inherent bidirectionality.<n>State-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference.<n>We introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking.<n>For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models.
arXiv Detail & Related papers (2025-05-27T17:39:39Z) - Training-free and Adaptive Sparse Attention for Efficient Long Video Generation [31.615453637053793]
generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency.<n>We propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method.<n>AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs.
arXiv Detail & Related papers (2025-02-28T14:11:20Z) - END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z) - SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.