Related papers: Training-Free Efficient Video Generation via Dynamic Token Carving

Training-Free Efficient Video Generation via Dynamic Token Carving

URL: http://arxiv.org/abs/2505.16864v1
Date: Thu, 22 May 2025 16:21:32 GMT
Title: Training-Free Efficient Video Generation via Dynamic Token Carving
Authors: Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia,
Abstract summary: Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
Score: 54.52061549312799
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

Related papers

Taming Diffusion Transformer for Real-Time Mobile Video Generation [72.20660234882594]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z)
VMoBA: Mixture-of-Block Attention for Video Diffusion Models [29.183614108287276]
This paper introduces Video Mixture of Block Attention (VMoBA), a novel attention mechanism specifically adapted for Video Diffusion Models (VDMs)<n>Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, VMoBA enhances the original MoBA framework with three key modifications.<n>Experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup.
arXiv Detail & Related papers (2025-06-30T13:52:31Z)
LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models [17.858801012726445]
Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness.<n>We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges.<n>Our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.
arXiv Detail & Related papers (2025-06-06T07:09:07Z)
RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy [10.53687668536011]
RainFusion exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality.<n>Our proposed bf RainFusion is a plug-and-play method that can be seamlessly integrated into state-of-the-art 3D-attention video generation models.
arXiv Detail & Related papers (2025-05-27T11:15:02Z)
Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis [50.77548592888096]
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals.<n>Turbo2K is an efficient framework for generating detail-rich 2K videos.
arXiv Detail & Related papers (2025-04-20T03:30:59Z)
Training-free and Adaptive Sparse Attention for Efficient Long Video Generation [31.615453637053793]
generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency.<n>We propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method.<n>AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs.
arXiv Detail & Related papers (2025-02-28T14:11:20Z)
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos.<n>DiTs can consume up to 95% of processing time and demand specialized context parallelism.<n>This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe.
arXiv Detail & Related papers (2025-02-11T14:39:59Z)
CascadeV: An Implementation of Wurstchen Architecture for Video Generation [4.086317089863318]
We propose a cascaded latent diffusion model (LDM) that is capable of producing state-of-the-art 2K resolution videos.<n> Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation.<n>Our model can be cascaded with existing T2V models, theoretically enabling a 4$times$ increase in resolution or frames per second without any fine-tuning.
arXiv Detail & Related papers (2025-01-28T01:14:24Z)
MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion [0.0]
We propose a Multi-Scale Causal (MSC) framework to address these problems.<n>We introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation.<n>We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training.
arXiv Detail & Related papers (2024-12-13T03:39:09Z)
Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache) We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z)
Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [41.505829393818274]
Current video diffusion models exhibit demanding computational requirements and high peak memory usage. We present Streamlined Inference, which leverages the temporal and spatial properties of video diffusion models. Our approach significantly reduces peak memory and computational overhead, making it feasible to generate high-quality videos on a single consumer GPU.
arXiv Detail & Related papers (2024-11-02T07:52:18Z)
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.<n>A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.<n>An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z)
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z)
Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space. We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.