Fugu-MT 論文翻訳(概要): SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

論文の概要: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

arxiv url: http://arxiv.org/abs/2509.24006v1
Date: Sun, 28 Sep 2025 17:58:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.588015
Title: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
Title（参考訳）: SLA: Sparse-Linear Attentionによる拡散変圧器の空間性を超えて
Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen,
Abstract要約: Diffusion Transformer(DiT)モデルでは、特にビデオ生成において、注意遅延が大きなボトルネックとなっている。注目重量は2つの部分に分けられる: 高いランクの大型重量のごく一部と、非常に低いランクの残りの重量の2つである。本稿では,拡散モデルを高速化するために,疎度と直線的注意を融合させる訓練可能な注意法SLAを提案する。
参考スコア（独自算出の注目度）: 88.47701139980636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.
Abstract（参考訳）: Diffusion Transformer(DiT)モデルでは、特にビデオ生成では、長いシーケンス長と二次的な複雑さのために注意遅延が大きなボトルネックとなっている。注目重量は2つの部分に分けられる: 高いランクの大型重量のごく一部と、非常に低いランクの残りの重量の2つである。これは自然に第1の部分にスパース加速度を適用し、第2の部分にローランク加速度を適用することを示唆している。そこで本研究では,拡散モデルの高速化を目的として,スパースと線形の注意を融合させる訓練可能な注意法であるSLA(Sparse-Linear Attention)を提案する。 SLAは、注意重みを臨界、限界、無視可能なカテゴリに分類し、O(N^2) を臨界重みに適用し、O(N) を限界重みに適用し、無視可能なカテゴリーをスキップする。 SLAはこれらの計算を単一のGPUカーネルに統合し、前方パスと後方パスの両方をサポートする。 SLAを用いた微調整のステップはわずかだが、DiTモデルは注意計算の20倍の削減を実現し、生成品質を損なうことなく大幅に加速する。実験により、SLAは、エンド・ツー・エンド・ジェネレーションの品質を劣化させることなく、注意計算を95%削減することが示された。 Wan2.1-1.3B上のビデオ生成において、注意計算において13.7倍の高速化と2.2倍のエンドツーエンドの高速化をもたらすSLAのための効率的なGPUカーネルを実装した。

論文の概要: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

関連論文リスト