Fugu-MT 論文翻訳(概要): SparseD: Sparse Attention for Diffusion Language Models

論文の概要: SparseD: Sparse Attention for Diffusion Language Models

arxiv url: http://arxiv.org/abs/2509.24014v1
Date: Sun, 28 Sep 2025 18:10:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.592201
Title: SparseD: Sparse Attention for Diffusion Language Models
Title（参考訳）: SparseD:拡散言語モデルのためのスパースアテンション
Authors: Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang,
Abstract要約: 拡散言語モデル(DLM)は自己回帰モデル(AR)に代わる有望な代替手段を提供する既存のオープンソースDLMは、高い推論遅延に悩まされている。 DLMのための新しいスパースアテンション手法であるスパースDを提案する。
参考スコア（独自算出の注目度）: 98.05780626106555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.
Abstract（参考訳）: 拡散言語モデル(DLM)は自己回帰モデル(AR)に代わる有望な代替手段を提供するが、既存のオープンソースDLMは高い推論遅延に悩まされている。このボトルネックは主に、全てのクエリキー対の計算におけるコンテキスト長に関する注意の二次的な複雑さに起因する。直感的には、この複雑さを減らすために、最も関係のある接続のみを保持するスパースパターンに注意を向けることが自然な戦略である。このようなアプローチはARにおいてよく確立されており、注意は固定的で明確に定義されたスパースパターンに従う。しかし, DLMでは, 1) 頭部ごとに注意パターンが異なり, (2) 頭部の注意パターンは認知ステップ間で非常に類似しており, 3) 早期認知ステップは生成に不可欠である。これらの知見は, 頭部特異的な構造とリスク劣化の発生を早期認知段階に適用した場合に捕捉できないため, DLMとほとんど相容れないAR向けに設計された疎度な注意法を示す。これらの課題に対処するために,DLMの新しいスパースアテンション手法であるSparseDを提案する。観察を活用すれば、SparseDは頭固有のスパースパターンを一度にプリ計算するだけで、すべてのステップで再利用できる。これにより、各denoisingステップでスパースパターンの再計算が防止される。一方、SparseDは初期の段階では十分に注意を払っており、その後、世代品質を維持するために注意をそろえるように切り替えている。これらとともに、SparseDは、長期コンテキストアプリケーションにDLMをデプロイするための実用的で効率的なソリューションとして確立されている。実験の結果、SparseDはロスレスアクセラレーションを実現し、最大$1.50\times$ FlashAttentionを64kのコンテキスト長で1024のデノベーションステップで高速化することを示した。

論文の概要: SparseD: Sparse Attention for Diffusion Language Models

関連論文リスト