BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
- URL: http://arxiv.org/abs/2508.10774v2
- Date: Mon, 29 Sep 2025 15:46:55 GMT
- Title: BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
- Authors: Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, Bohan Zhuang,
- Abstract summary: We present BLADE, a data-free joint training framework for video inference.<n>We show remarkable efficiency gains across different scales.<n>On models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup.
- Score: 27.57431718095974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm, built upon Trajectory Distribution Matching (TDM), directly incorporates sparsity into the distillation process rather than treating it as a separate compression step and features fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B, and our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Project is available at http://ziplab.co/BLADE-Homepage/.
Related papers
- LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration [12.183601881545039]
Diffusion models have achieved remarkable success in image and video generation tasks.<n>However, the high computational demands of Diffusion Transformers pose a significant challenge to their practical deployment.<n>We propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training.
arXiv Detail & Related papers (2026-02-24T02:53:28Z) - GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers [5.2424169748898555]
GalaxyDiT is a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics.<n>We achieve $1.87times$ and $2.37times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark.<n>At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR)
arXiv Detail & Related papers (2025-12-03T05:08:18Z) - StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z) - Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency [60.74505433956616]
continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion.<n>We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks.<n>We propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer.
arXiv Detail & Related papers (2025-10-09T16:45:30Z) - POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models [18.761042377485367]
POSE (Phased One-Step Equilibrium) is a distillation framework that reduces the sampling steps of large-scale video diffusion models.<n>We show that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality.
arXiv Detail & Related papers (2025-08-28T17:20:01Z) - Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z) - Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers [22.349130691342687]
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment.<n>We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - VORTA: Efficient Video Diffusion via Routing Sparse Attention [54.84294780326206]
VORTA is an acceleration framework with two novel components.<n>It achieves an end-to-end speedup $1.76times$ without loss of quality on VBench.<n>It can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41times$ with negligible performance degradation.
arXiv Detail & Related papers (2025-05-24T17:46:47Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.