LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models
- URL: http://arxiv.org/abs/2309.15103v2
- Date: Wed, 27 Sep 2023 03:51:52 GMT
- Title: LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models
- Authors: Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi
Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing
Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua
Lin, Yu Qiao, Ziwei Liu
- Abstract summary: We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
- Score: 133.088893990272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to learn a high-quality text-to-video (T2V) generative model
by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a
highly desirable yet challenging task to simultaneously a) accomplish the
synthesis of visually realistic and temporally coherent videos while b)
preserving the strong creative generation nature of the pre-trained T2I model.
To this end, we propose LaVie, an integrated video generation framework that
operates on cascaded video latent diffusion models, comprising a base T2V
model, a temporal interpolation model, and a video super-resolution model. Our
key insights are two-fold: 1) We reveal that the incorporation of simple
temporal self-attentions, coupled with rotary positional encoding, adequately
captures the temporal correlations inherent in video data. 2) Additionally, we
validate that the process of joint image-video fine-tuning plays a pivotal role
in producing high-quality and creative outcomes. To enhance the performance of
LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M,
consisting of 25 million text-video pairs that prioritize quality, diversity,
and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves
state-of-the-art performance both quantitatively and qualitatively.
Furthermore, we showcase the versatility of pre-trained LaVie models in various
long video generation and personalized video synthesis applications.
Related papers
- T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals.
We highlight the crucial importance of tailoring datasets to specific learning objectives.
We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z) - VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation [6.463753697299011]
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality temporally coherent videos.
Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
arXiv Detail & Related papers (2024-09-21T13:59:50Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.