Fugu-MT 論文翻訳(概要): Faster Video Diffusion with Trainable Sparse Attention

論文の概要: Faster Video Diffusion with Trainable Sparse Attention

arxiv url: http://arxiv.org/abs/2505.13389v2
Date: Wed, 21 May 2025 15:36:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-22 13:19:52.3392
Title: Faster Video Diffusion with Trainable Sparse Attention
Title（参考訳）: トレーニング可能なスパース注意による高速ビデオ拡散
Authors: Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang,
Abstract要約: ビデオ拡散トランス (DiTs) のスケーリングは、注意質量の大部分が少数の位置に集中しているにもかかわらず、2次元の注意によって制限される。私たちはこの観察を、トレーニング可能なハードウェア効率の良いスパースアテンションであるVSAに変換し、Emphbothのトレーニングと推論の完全なアテンションを置き換える。
参考スコア（独自算出の注目度）: 21.593548582058403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.
Abstract（参考訳）: ビデオ拡散トランス (DiTs) のスケーリングは、注意質量の大部分が少数の位置に集中しているにもかかわらず、2次元の注意によって制限される。私たちはこの観察を、トレーニング可能なハードウェア効率の高いスパースアテンションであるVSAに変換し、トレーニングと推論における完全な注意を置き換えます。 VSAでは、軽量な粗いステージがトークンをタイルにプールし、ハイウェイトな \emph{ critical tokens} を識別する。これにより、エンドツーエンドをトレーニングし、ポストホックプロファイリングを必要とせず、FlashAttention3 MFUの85%を維持できる、単一の差別化可能なカーネルが作られる。 60M から 1.4B のパラメータから DiT を事前学習することで, アブレーション研究とスケーリング法則の実験を大規模に実施する。 VSAは、FLOPSのトレーニングを2.53$\times$に削減するParetoポイントに達する。オープンソースのWan-2.1モデルの再適合は、注意時間を6$\times$でスピードアップし、エンドツーエンドの生成時間を31sから18sに短縮する。これらの結果は、フルアテンションに代わる実用的な代替手段としての訓練可能なスパースアテンションと、ビデオ拡散モデルのさらなるスケーリングのための重要な有効性を確立した。

論文の概要: Faster Video Diffusion with Trainable Sparse Attention

関連論文リスト