Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
- URL: http://arxiv.org/abs/2311.10709v2
- Date: Fri, 2 Aug 2024 18:55:25 GMT
- Title: Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
- Authors: Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra,
- Abstract summary: We present Emu Video, a text-to-video generation model that factorizes the generation into two steps.
We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training.
In human evaluations, our generated videos are strongly preferred in quality compared to all prior work.
- Score: 59.01091079005586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.
Related papers
- Movie Gen: A Cast of Media Foundation Models [133.41504332082667]
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio.
We show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image.
arXiv Detail & Related papers (2024-10-17T16:22:46Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [52.93036326078229]
Off-the-shelf billion-scale datasets for image generation are available, but collecting similar video data of the same scale is still challenging.
In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task.
Our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks.
arXiv Detail & Related papers (2023-05-17T17:59:16Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.