Imagen Video: High Definition Video Generation with Diffusion Models
- URL: http://arxiv.org/abs/2210.02303v1
- Date: Wed, 5 Oct 2022 14:41:38 GMT
- Title: Imagen Video: High Definition Video Generation with Diffusion Models
- Authors: Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao,
Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J.
Fleet, Tim Salimans
- Abstract summary: Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
- Score: 64.06483414521222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Imagen Video, a text-conditional video generation system based on
a cascade of video diffusion models. Given a text prompt, Imagen Video
generates high definition videos using a base video generation model and a
sequence of interleaved spatial and temporal video super-resolution models. We
describe how we scale up the system as a high definition text-to-video model
including design decisions such as the choice of fully-convolutional temporal
and spatial super-resolution models at certain resolutions, and the choice of
the v-parameterization of diffusion models. In addition, we confirm and
transfer findings from previous work on diffusion-based image generation to the
video generation setting. Finally, we apply progressive distillation to our
video models with classifier-free guidance for fast, high quality sampling. We
find Imagen Video not only capable of generating videos of high fidelity, but
also having a high degree of controllability and world knowledge, including the
ability to generate diverse videos and text animations in various artistic
styles and with 3D object understanding. See
https://imagen.research.google/video/ for samples.
Related papers
- Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images.
We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition.
We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.