Latent Video Diffusion Models for High-Fidelity Long Video Generation
- URL: http://arxiv.org/abs/2211.13221v2
- Date: Mon, 20 Mar 2023 17:29:45 GMT
- Title: Latent Video Diffusion Models for High-Fidelity Long Video Generation
- Authors: Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen
- Abstract summary: We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
- Score: 58.346702410885236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI-generated content has attracted lots of attention recently, but
photo-realistic video synthesis is still challenging. Although many attempts
using GANs and autoregressive models have been made in this area, the visual
quality and length of generated videos are far from satisfactory. Diffusion
models have shown remarkable results recently but require significant
computational resources. To address this, we introduce lightweight video
diffusion models by leveraging a low-dimensional 3D latent space, significantly
outperforming previous pixel-space video diffusion models under a limited
computational budget. In addition, we propose hierarchical diffusion in the
latent space such that longer videos with more than one thousand frames can be
produced. To further overcome the performance degradation issue for long video
generation, we propose conditional latent perturbation and unconditional
guidance that effectively mitigate the accumulated errors during the extension
of video length. Extensive experiments on small domain datasets of different
categories suggest that our framework generates more realistic and longer
videos than previous strong baselines. We additionally provide an extension to
large-scale text-to-video generation to demonstrate the superiority of our
work. Our code and models will be made publicly available.
Related papers
- ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning [36.378348127629195]
We propose a novel post-tuning methodology for video synthesis models, called ExVideo.
This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations.
Our approach augments the model's capacity to generate up to $5times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos.
arXiv Detail & Related papers (2024-06-20T09:18:54Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.