Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models
- URL: http://arxiv.org/abs/2304.08818v2
- Date: Thu, 28 Dec 2023 03:31:59 GMT
- Title: Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models
- Authors: Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook
Kim, Sanja Fidler, Karsten Kreis
- Abstract summary: Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
- Score: 71.11425812806431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Latent Diffusion Models (LDMs) enable high-quality image synthesis while
avoiding excessive compute demands by training a diffusion model in a
compressed lower-dimensional latent space. Here, we apply the LDM paradigm to
high-resolution video generation, a particularly resource-intensive task. We
first pre-train an LDM on images only; then, we turn the image generator into a
video generator by introducing a temporal dimension to the latent space
diffusion model and fine-tuning on encoded image sequences, i.e., videos.
Similarly, we temporally align diffusion model upsamplers, turning them into
temporally consistent video super resolution models. We focus on two relevant
real-world applications: Simulation of in-the-wild driving data and creative
content creation with text-to-video modeling. In particular, we validate our
Video LDM on real driving videos of resolution 512 x 1024, achieving
state-of-the-art performance. Furthermore, our approach can easily leverage
off-the-shelf pre-trained image LDMs, as we only need to train a temporal
alignment model in that case. Doing so, we turn the publicly available,
state-of-the-art text-to-image LDM Stable Diffusion into an efficient and
expressive text-to-video model with resolution up to 1280 x 2048. We show that
the temporal layers trained in this way generalize to different fine-tuned
text-to-image LDMs. Utilizing this property, we show the first results for
personalized text-to-video generation, opening exciting directions for future
content creation. Project page:
https://research.nvidia.com/labs/toronto-ai/VideoLDM/
Related papers
- JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation [6.463753697299011]
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality temporally coherent videos.
Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
arXiv Detail & Related papers (2024-09-21T13:59:50Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition [124.41196697408627]
We propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation.
CMD encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation.
We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model.
arXiv Detail & Related papers (2024-03-21T05:48:48Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - LLM-grounded Video Diffusion Models [57.23066793349706]
Video diffusion models have emerged as a promising tool for neuraltemporal generation.
Current models struggle with prompts and often restricted or incorrect motion.
We introduce LLM-grounded Video Diffusion (LVD)
Our results demonstrate that LVD significantly outperforms its base video diffusion model.
arXiv Detail & Related papers (2023-09-29T17:54:46Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.