SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
- URL: http://arxiv.org/abs/2509.24695v2
- Date: Mon, 13 Oct 2025 09:12:27 GMT
- Title: SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
- Authors: Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie,
- Abstract summary: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration.<n>Two core designs ensure our efficient, effective and long video generation.<n>Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models.
- Score: 116.17385614259574
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
Related papers
- Helios: Real Real-Time Long Video Generation Model [33.34372252025333]
Helios is a 14B autoregressive diffusion model with a unified input representation that supports T2V, I2V, and V2V tasks.<n>Helios consistently outperforms prior methods on both short- and long-video generation.<n>We plan to release the code, base model, and distilled model to support further development by the community.
arXiv Detail & Related papers (2026-03-04T18:45:21Z) - DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder [55.26098043655325]
DC-VideoGen can be applied to any pre-trained video diffusion model.<n>It can be adapted to a deep compression latent space with lightweight fine-tuning.
arXiv Detail & Related papers (2025-09-29T17:59:31Z) - Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers [29.130090574300635]
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their compute demands pose a major challenge for practical deployment.<n>We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z) - SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device [61.42406720183769]
We propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users.<n>Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds.
arXiv Detail & Related papers (2024-12-13T18:59:56Z) - REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.<n>We argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents.<n>We design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - Video-Infinity: Distributed Long Video Generation [73.30145218077074]
Diffusion models have recently achieved remarkable results for video generation.
Our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
arXiv Detail & Related papers (2024-06-24T01:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.