Photorealistic Video Generation with Diffusion Models
- URL: http://arxiv.org/abs/2312.06662v1
- Date: Mon, 11 Dec 2023 18:59:57 GMT
- Title: Photorealistic Video Generation with Diffusion Models
- Authors: Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei,
Irfan Essa, Lu Jiang, Jos\'e Lezama
- Abstract summary: W.A.L.T. is a transformer-based approach for video generation via diffusion modeling.
We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities.
We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
- Score: 44.95407324724976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present W.A.L.T, a transformer-based approach for photorealistic video
generation via diffusion modeling. Our approach has two key design decisions.
First, we use a causal encoder to jointly compress images and videos within a
unified latent space, enabling training and generation across modalities.
Second, for memory and training efficiency, we use a window attention
architecture tailored for joint spatial and spatiotemporal generative modeling.
Taken together these design decisions enable us to achieve state-of-the-art
performance on established video (UCF-101 and Kinetics-600) and image
(ImageNet) generation benchmarks without using classifier free guidance.
Finally, we also train a cascade of three models for the task of text-to-video
generation consisting of a base latent video diffusion model, and two video
super-resolution diffusion models to generate videos of $512 \times 896$
resolution at $8$ frames per second.
Related papers
- ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model [41.825824810180215]
This paper presents SNED, a superposition network architecture search method for efficient video diffusion model.
Our method employs a supernet training paradigm that targets various model cost and resolution options.
Our framework consistently produces comparable results across different model options with high efficiency.
arXiv Detail & Related papers (2024-05-31T21:12:30Z) - Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition [124.41196697408627]
We propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation.
CMD encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation.
We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model.
arXiv Detail & Related papers (2024-03-21T05:48:48Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.