Related papers: Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

URL: http://arxiv.org/abs/2403.14148v1
Date: Thu, 21 Mar 2024 05:48:48 GMT
Title: Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition
Authors: Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, Anima Anandkumar,
Abstract summary: We propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. CMD encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model.
Score: 124.41196697408627
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.

Related papers

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation [4.261090951843438]
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ based on two consecutive neighboring frames.<n>Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance.<n>We propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model.
arXiv Detail & Related papers (2025-07-07T13:25:32Z)
REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents. We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z)
Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling. We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z)
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands. Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z)
Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space. We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)
MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z)
Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We propose a diffusion model for video generation that shows very promising initial results. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.