Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis
- URL: http://arxiv.org/abs/2504.14470v1
- Date: Sun, 20 Apr 2025 03:30:59 GMT
- Title: Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis
- Authors: Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, Lei Zhu,
- Abstract summary: Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals.<n>Turbo2K is an efficient framework for generating detail-rich 2K videos.
- Score: 50.77548592888096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.
Related papers
- H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [76.1519545010611]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.<n>In this work, we examine the architecture design choices and optimize the computation distribution to obtain efficient and high-compression video AEs.<n>Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics.
arXiv Detail & Related papers (2025-04-14T17:59:06Z) - REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling.<n>Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions.<n>We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z) - CascadeV: An Implementation of Wurstchen Architecture for Video Generation [4.086317089863318]
We propose a cascaded latent diffusion model (LDM) that is capable of producing state-of-the-art 2K resolution videos.<n> Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation.<n>Our model can be cascaded with existing T2V models, theoretically enabling a 4$times$ increase in resolution or frames per second without any fine-tuning.
arXiv Detail & Related papers (2025-01-28T01:14:24Z) - Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.<n>By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.<n>To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv Detail & Related papers (2025-01-14T21:53:11Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.<n>CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.<n>CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [41.505829393818274]
Current video diffusion models exhibit demanding computational requirements and high peak memory usage.
We present Streamlined Inference, which leverages the temporal and spatial properties of video diffusion models.
Our approach significantly reduces peak memory and computational overhead, making it feasible to generate high-quality videos on a single consumer GPU.
arXiv Detail & Related papers (2024-11-02T07:52:18Z) - SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation.
SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z) - Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner.
We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details.
The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z) - Slimmable Video Codec [24.460763016660685]
We propose a slimmable video (SlimVC) by integrating a slimmable temporal entropy model in a slimmable autoencoder.
Despite a significantly more complex architecture, we show that slimming remains a powerful mechanism to control rate, memory footprint, computational cost and latency.
arXiv Detail & Related papers (2022-05-13T16:37:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.