PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation
- URL: http://arxiv.org/abs/2507.16116v1
- Date: Tue, 22 Jul 2025 00:09:37 GMT
- Title: PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation
- Authors: Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel,
- Abstract summary: Pusa is a groundbreaking paradigm that enables fine-grained temporal control within a unified video diffusion framework.<n>We set a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32%.<n>This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis.
- Score: 18.2095668161519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with $\leq$ 1/200 of the training cost (\$500 vs. $\geq$ \$100,000) and $\leq$ 1/2500 of the dataset size (4K vs. $\geq$ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32\% (vs. 86.86\% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen
Related papers
- Helios: Real Real-Time Long Video Generation Model [33.34372252025333]
Helios is a 14B autoregressive diffusion model with a unified input representation that supports T2V, I2V, and V2V tasks.<n>Helios consistently outperforms prior methods on both short- and long-video generation.<n>We plan to release the code, base model, and distilled model to support further development by the community.
arXiv Detail & Related papers (2026-03-04T18:45:21Z) - Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation [76.04880323498598]
We introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX)<n>Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains.
arXiv Detail & Related papers (2025-12-12T18:56:35Z) - GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers [5.2424169748898555]
GalaxyDiT is a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics.<n>We achieve $1.87times$ and $2.37times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark.<n>At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR)
arXiv Detail & Related papers (2025-12-03T05:08:18Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers [22.349130691342687]
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment.<n>We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models [34.131515004434846]
We introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient approach for adapting pretrained video diffusion models to conditional generation tasks.<n>TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples.<n>We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B.
arXiv Detail & Related papers (2025-06-01T12:57:43Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z) - Temporally-Adaptive Models for Efficient Video Understanding [36.413570840293005]
This work shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modeling complex temporal dynamics in videos.
Specifically, TAdaConv empowers spatial convolutions with temporal modeling abilities by calibrating the convolution weights for each frame according to its local and global temporal context.
Compared to existing operations for temporal modeling, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions.
arXiv Detail & Related papers (2023-08-10T17:35:47Z) - VA-RED$^2$: Video Adaptive Redundancy Reduction [64.75692128294175]
We present a redundancy reduction framework, VA-RED$2$, which is input-dependent.
We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism.
Our framework achieves $20% - 40%$ reduction in computation (FLOPs) when compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-15T22:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.