Fugu-MT 論文翻訳(概要): FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers

論文の概要: FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers

arxiv url: http://arxiv.org/abs/2509.16518v1
Date: Sat, 20 Sep 2025 03:48:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:15.832531
Title: FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers
Title（参考訳）: FG-Attn:拡散変圧器における微細粒状スペーサの活用
Authors: Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar,
Abstract要約: 長文拡散変換器のスパースアテンション機構であるFG-Attnを提案する。本手法は注意マップのMx1スライス粒度の計算を省略する。 5秒、480pの動画で平均1.55倍のスピードアップを達成し、1つのH100 GPUで平均1.41倍の5秒、720pのビデオで平均1.41倍のスピードアップを達成している。
参考スコア（独自算出の注目度）: 6.260564859775371
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Generating realistic videos with diffusion transformers demands significant computation, with attention layers the central bottleneck; even producing a short clip requires running a transformer over a very long sequence of embeddings, e.g., more than 30K embeddings for a 5-second video, incurring significant latency. Prior work aims to mitigate this bottleneck by exploiting sparsity in the attention layers to reduce computation. However, these works typically rely on block-sparse attention, which skips score computation only when all entries in a block of attention scores (corresponding to M queries and M keys, with M = 64 typically) are zero. This coarse-granular skipping of attention scores does not fully exploit sparsity in the attention map and leaves room for improvement. In this work, we propose FG-Attn, a sparse attention mechanism for long-context diffusion transformers that leverages sparsity at a fine granularity. Unlike block-sparse attention, which skips entire MxM blocks, our approach skips computations at the granularity of Mx1 slices of the attention map. Each slice is produced by query-key dot products between a block of query vectors and a single key. To implement our proposed sparse attention mechanism, we develop a new efficient bulk-load operation called asynchronous-gather load. This load operation gathers a sparse set of relevant key-value vectors from memory and arranges them into packed tiles in the GPU's shared memory. Only a sparse set of keys relevant to those queries are loaded into shared memory when computing attention for a block of queries, in contrast to loading full blocks of key tokens in block-sparse attention. Our fine-grained sparse attention, applied to video diffusion models, achieves an average 1.55X (up to 1.65X) speedup for 5 second, 480p videos, and an average 1.41X (up to 1.49X) for 5 second, 720p videos on a single H100 GPU.
Abstract（参考訳）: 拡散トランスフォーマーでリアルなビデオを生成するには、大きな計算が必要で、注意層が中心的なボトルネックとなる。短いクリップを生成しても、非常に長い一連の埋め込み、例えば5秒のビデオに対して30K以上の埋め込みを実行し、大きなレイテンシを発生させる必要がある。それまでの作業は、注意層内の疎結合を利用して計算量を削減することで、このボトルネックを軽減することを目的としていた。しかし、これらの作業は一般にブロックスパース・アテンションに依存しており、注意点のブロック内の全てのエントリ(MクエリとMキーに対応し、通常M = 64)がゼロである場合にのみスコア計算をスキップする。注意点の粗い粒状スキップは、注意マップの空間性を十分に活用せず、改善の余地を残している。本研究では, 細粒度での分散度を利用した長文拡散変換器のスパースアテンション機構であるFG-Attnを提案する。 MxMブロック全体をスキップするブロックスパースアテンションとは異なり、本手法はアテンションマップのMx1スライス粒度での計算を省略する。各スライスは、クエリベクトルのブロックと単一のキーの間のクエリキードット製品によって生成される。提案するスパースアテンション機構を実装するため,非同期ガザ負荷と呼ばれる高効率なバルクロード演算を開発した。この負荷操作は、関連するキー値ベクトルのスパースセットをメモリから収集し、GPUの共有メモリに詰め込まれたタイルに配置する。ブロックスパースアテンションでキートークンの完全なブロックをロードするのとは対照的に、クエリのブロックに対する注意を計算する場合、これらのクエリに関連するスパースセットのみが共有メモリにロードされる。ビデオ拡散モデルに適用した粒度の細かな注意は、平均1.55X(最大1.65X)のスピードアップを5秒、480pの動画で達成し、平均1.41X(最大1.49X)のスピードアップを1つのH100 GPUで達成します。

論文の概要: FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers

関連論文リスト