Magic 1-For-1: Generating One Minute Video Clips within One Minute
- URL: http://arxiv.org/abs/2502.07701v3
- Date: Mon, 17 Feb 2025 02:02:08 GMT
- Title: Magic 1-For-1: Generating One Minute Video Clips within One Minute
- Authors: Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou,
- Abstract summary: We present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency.
By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics.
- Score: 53.07214657235465
- License:
- Abstract: In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.
Related papers
- Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile [28.913893318345384]
Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps.
This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data, and 2) Shorten the sampling process by adopting existing multi-step consistency distillation.
arXiv Detail & Related papers (2025-02-10T05:00:56Z) - Real-time One-Step Diffusion-based Expressive Portrait Videos Generation [85.07446744308247]
We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars.
Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster.
arXiv Detail & Related papers (2024-12-18T03:42:42Z) - REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.
In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents.
We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z) - FrameExit: Conditional Early Exiting for Efficient Video Recognition [11.92976432364216]
We propose a conditional early exiting framework for efficient video recognition.
Our model learns to process fewer frames for simpler videos and more frames for complex ones.
Our method sets a new state of the art for efficient video understanding on the HVU benchmark.
arXiv Detail & Related papers (2021-04-27T18:01:05Z) - PGT: A Progressive Method for Training Models on Long Videos [45.935259079953255]
Main-stream method is to split a raw video into clips, leading to incomplete temporal information flow.
Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property.
We empirically demonstrate that it yields significant performance improvements on different models and datasets.
arXiv Detail & Related papers (2021-03-21T06:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.