Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis
- URL: http://arxiv.org/abs/2402.14797v1
- Date: Thu, 22 Feb 2024 18:55:08 GMT
- Title: Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis
- Authors: Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina
Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci,
Jian Ren, Sergey Tulyakov
- Abstract summary: We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
- Score: 69.83405335645305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary models for generating images show remarkable quality and
versatility. Swayed by these advantages, the research community repurposes them
to generate videos. Since video content is highly redundant, we argue that
naively bringing advances of image models to the video generation domain
reduces motion fidelity, visual quality and impairs scalability. In this work,
we build Snap Video, a video-first model that systematically addresses these
challenges. To do that, we first extend the EDM framework to take into account
spatially and temporally redundant pixels and naturally support video
generation. Second, we show that a U-Net - a workhorse behind image generation
- scales poorly when generating videos, requiring significant computational
overhead. Hence, we propose a new transformer-based architecture that trains
3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us
to efficiently train a text-to-video model with billions of parameters for the
first time, reach state-of-the-art results on a number of benchmarks, and
generate videos with substantially higher quality, temporal consistency, and
motion complexity. The user studies showed that our model was favored by a
large margin over the most recent methods. See our website at
https://snap-research.github.io/snapvideo/.
Related papers
- AtomoVideo: High Fidelity Image-to-Video Generation [25.01443995920118]
We propose a high fidelity framework for image-to-video generation, named AtomoVideo.
Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image.
Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation.
arXiv Detail & Related papers (2024-03-04T07:41:50Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z) - StyleGAN-V: A Continuous Video Generator with the Price, Image Quality
and Perks of StyleGAN2 [39.835681276854025]
We think of videos of what they should be - time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator.
We build our model on top of StyleGAN2 and it is just 5% more expensive to train at the same resolution while achieving almost the same image quality.
Our model achieves state-of-the-art results on four modern 256$2$ video synthesis benchmarks and one 1024$2$ resolution one.
arXiv Detail & Related papers (2021-12-29T17:58:29Z) - A Good Image Generator Is What You Need for High-Resolution Video
Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos.
We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator.
We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.