Fugu-MT 論文翻訳(概要): Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

論文の概要: Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

arxiv url: http://arxiv.org/abs/2508.10774v1
Date: Thu, 14 Aug 2025 15:58:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.393446
Title: Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
Title（参考訳）: Video-BLADE:Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
Authors: Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang,
Abstract要約: ビデオ推論のためのデータフリーのジョイントトレーニングフレームワークBLADEを提案する。私たちのフレームワークは、さまざまなスケールで顕著な効率向上を示します。短いビデオシーケンス長を持つCagVideoX-5Bのようなモデルでは、我々のフレームワークはロバストな8.89倍のスピードアップを提供する。
参考スコア（独自算出の注目度）: 17.18501092926442
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.
Abstract（参考訳）: 拡散変換器は現在、高品質なビデオ生成においてこの分野をリードしているが、その遅い反復分解過程と長いシーケンスに対する2次的注意の禁止は、重要な推論ボトルネックを生み出している。ステップ蒸留とスパースアテンション機構はどちらも独立した加速戦略として有望であるが、これらのアプローチを効果的に組み合わせることで重要な課題が提示される。これらの制約を克服するために,(1)コンテンツ認識の空間性マスクを動的に生成する適応ブロックスパース・アテンション(ASA)機構,(2)トラジェクティブ・ディストリビューション・マッチング(TDM)上に構築された空間性を考慮した蒸留パラダイムを提案する。 BLADEをCagVideoX-5BやWan2.1-1.3Bといったテキスト・ビデオモデルで検証する。私たちのフレームワークは、さまざまなスケールで顕著な効率向上を示します。 Wan2.1-1.3Bでは、BLADEは50ステップのベースライン上で14.10倍のエンドツーエンドの推論加速を達成する。さらに,短いビデオシーケンス長を持つCagVideoX-5Bのようなモデルでは,ロバストな8.89倍の高速化を実現している。重要な点として、アクセラレーションには一貫した品質改善が伴う。 VBench-2.0ベンチマークでは、BLADEはCagVideoX-5Bのスコアを0.569(0.534から)、Wan2.1-1.3Bのスコアを0.570(0.563から)に引き上げ、その結果は人間の評価において優れた評価によってさらに裏付けられている。私たちのコードとモデルの重み付けは、http://ziplab.co/BLADE-Homepage/.comで公開されています。

論文の概要: Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

関連論文リスト