Video Probabilistic Diffusion Models in Projected Latent Space
- URL: http://arxiv.org/abs/2302.07685v2
- Date: Thu, 30 Mar 2023 07:08:21 GMT
- Title: Video Probabilistic Diffusion Models in Projected Latent Space
- Authors: Sihyun Yu, Kihyuk Sohn, Subin Kim, Jinwoo Shin
- Abstract summary: We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
- Score: 75.4253202574722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the remarkable progress in deep generative models, synthesizing
high-resolution and temporally coherent videos still remains a challenge due to
their high-dimensionality and complex temporal dynamics along with large
spatial variations. Recent works on diffusion models have shown their potential
to solve this challenge, yet they suffer from severe computation- and
memory-inefficiency that limit the scalability. To handle this issue, we
propose a novel generative model for videos, coined projected latent video
diffusion models (PVDM), a probabilistic diffusion model which learns a video
distribution in a low-dimensional latent space and thus can be efficiently
trained with high-resolution videos under limited resources. Specifically, PVDM
is composed of two components: (a) an autoencoder that projects a given video
as 2D-shaped latent vectors that factorize the complex cubic structure of video
pixels and (b) a diffusion model architecture specialized for our new
factorized latent space and the training/sampling procedure to synthesize
videos of arbitrary length with a single model. Experiments on popular video
generation datasets demonstrate the superiority of PVDM compared with previous
video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the
UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of
the prior state-of-the-art.
Related papers
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach [29.753974393652356]
We propose a frame-aware video diffusion model(FVDM)
Our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies.
Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks.
arXiv Detail & Related papers (2024-10-04T05:47:39Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning [36.378348127629195]
We propose a novel post-tuning methodology for video synthesis models, called ExVideo.
This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations.
Our approach augments the model's capacity to generate up to $5times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos.
arXiv Detail & Related papers (2024-06-20T09:18:54Z) - GD-VDM: Generated Depth for better Diffusion-based Video Generation [18.039417502897486]
This paper proposes GD-VDM, a novel diffusion model for video generation, demonstrating promising results.
We evaluated GD-VDM on the Cityscapes dataset and found that it generates more diverse and complex scenes compared to natural baselines.
arXiv Detail & Related papers (2023-06-19T21:32:10Z) - Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z) - Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.