Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
- URL: http://arxiv.org/abs/2502.01776v1
- Date: Mon, 03 Feb 2025 19:29:16 GMT
- Title: Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
- Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han,
- Abstract summary: Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.<n>We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.<n>SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
- Score: 59.80405282381126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
Related papers
- GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting [17.17292309504131]
GaussVideoDreamer advances generative multimedia approaches by bridging the gap between image, video, and 3D generation.
Our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods.
arXiv Detail & Related papers (2025-04-14T09:04:01Z) - D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS [22.373386953378002]
Implicit Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting.
We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV.
We demonstrate D2GV's versatility in tasks including video, inpainting and denoising, underscoring its potential as a promising solution for video representation.
arXiv Detail & Related papers (2025-03-07T17:26:27Z) - DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos.
DiTs can consume up to 95% of processing time and demand specialized context parallelism.
This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe.
arXiv Detail & Related papers (2025-02-11T14:39:59Z) - VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a video autoencoder that decouples video into two distinct latent spaces.<n>Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements.<n>Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z) - Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos.<n>We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos.<n>This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z) - V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians [53.614560799043545]
V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians.
Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs.
As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
arXiv Detail & Related papers (2024-09-20T16:54:27Z) - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - Real-time Online Video Detection with Temporal Smoothing Transformers [4.545986838009774]
A good streaming recognition model captures both long-term dynamics and short-term changes of video.
To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel.
We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
arXiv Detail & Related papers (2022-09-19T17:59:02Z) - Local-Global Context Aware Transformer for Language-Guided Video
Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z) - DualFormer: Local-Global Stratified Transformer for Efficient Video
Recognition [140.66371549815034]
We propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition.
We show that DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
arXiv Detail & Related papers (2021-12-09T03:05:19Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.