SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device
- URL: http://arxiv.org/abs/2412.10494v1
- Date: Fri, 13 Dec 2024 18:59:56 GMT
- Title: SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device
- Authors: Yushu Wu, Zhixing Zhang, Yanyu Li, Yanwu Xu, Anil Kag, Yang Sui, Huseyin Coskun, Ke Ma, Aleksei Lebedev, Ju Hu, Dimitris Metaxas, Yanzhi Wang, Sergey Tulyakov, Jian Ren,
- Abstract summary: We propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users.
Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds.
- Score: 61.42406720183769
- License:
- Abstract: We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.
Related papers
- From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.
We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.
Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - Video-Infinity: Distributed Long Video Generation [73.30145218077074]
Diffusion models have recently achieved remarkable results for video generation.
Our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
arXiv Detail & Related papers (2024-06-24T01:56:12Z) - AtomoVideo: High Fidelity Image-to-Video Generation [25.01443995920118]
We propose a high fidelity framework for image-to-video generation, named AtomoVideo.
Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image.
Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation.
arXiv Detail & Related papers (2024-03-04T07:41:50Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.