Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets
- URL: http://arxiv.org/abs/2311.15127v1
- Date: Sat, 25 Nov 2023 22:28:38 GMT
- Title: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets
- Authors: Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch,
Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam
Letts, Varun Jampani, Robin Rombach
- Abstract summary: We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
- Score: 36.95521842177614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Stable Video Diffusion - a latent video diffusion model for
high-resolution, state-of-the-art text-to-video and image-to-video generation.
Recently, latent diffusion models trained for 2D image synthesis have been
turned into generative video models by inserting temporal layers and finetuning
them on small, high-quality video datasets. However, training methods in the
literature vary widely, and the field has yet to agree on a unified strategy
for curating video data. In this paper, we identify and evaluate three
different stages for successful training of video LDMs: text-to-image
pretraining, video pretraining, and high-quality video finetuning. Furthermore,
we demonstrate the necessity of a well-curated pretraining dataset for
generating high-quality videos and present a systematic curation process to
train a strong base model, including captioning and filtering strategies. We
then explore the impact of finetuning our base model on high-quality data and
train a text-to-video model that is competitive with closed-source video
generation. We also show that our base model provides a powerful motion
representation for downstream tasks such as image-to-video generation and
adaptability to camera motion-specific LoRA modules. Finally, we demonstrate
that our model provides a strong multi-view 3D-prior and can serve as a base to
finetune a multi-view diffusion model that jointly generates multiple views of
objects in a feedforward fashion, outperforming image-based methods at a
fraction of their compute budget. We release code and model weights at
https://github.com/Stability-AI/generative-models .
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model [34.35449902855767]
Two fundamental questions are what data we use for training and how to ensure multi-view consistency.
We propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models.
Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T17:48:15Z) - VideoCrafter2: Overcoming Data Limitations for High-Quality Video
Diffusion Models [76.85329896854189]
We investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model.
We shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model.
arXiv Detail & Related papers (2024-01-17T08:30:32Z) - Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling.
We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities.
We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.