Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
- URL: http://arxiv.org/abs/2502.01776v1
- Date: Mon, 03 Feb 2025 19:29:16 GMT
- Title: Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
- Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han,
- Abstract summary: Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.
We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.
SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
- Score: 59.80405282381126
- License:
- Abstract: Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
Related papers
- DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in modeling and generating high-quality videos.
This paper introduces DSV, a novel framework designed to accelerate and scale the training of video DiTs.
arXiv Detail & Related papers (2025-02-11T14:39:59Z) - Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile [28.913893318345384]
Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps.
This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data, and 2) Shorten the sampling process by adopting existing multi-step consistency distillation.
arXiv Detail & Related papers (2025-02-10T05:00:56Z) - VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a video autoencoder that decouples video into two distinct latent spaces.
Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements.
Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z) - Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos.
We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos.
This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z) - V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians [53.614560799043545]
V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians.
Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs.
As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
arXiv Detail & Related papers (2024-09-20T16:54:27Z) - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - Real-time Online Video Detection with Temporal Smoothing Transformers [4.545986838009774]
A good streaming recognition model captures both long-term dynamics and short-term changes of video.
To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel.
We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
arXiv Detail & Related papers (2022-09-19T17:59:02Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.