Related papers: VideoLCM: Video Latent Consistency Model

VideoLCM: Video Latent Consistency Model

URL: http://arxiv.org/abs/2312.09109v1
Date: Thu, 14 Dec 2023 16:45:36 GMT
Title: VideoLCM: Video Latent Consistency Model
Authors: Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, Nong Sang
Abstract summary: VideoLCM builds upon existing latent video diffusion models and incorporates consistency distillation techniques for training the latent consistency model. VideoLCM achieves high-fidelity and smooth video synthesis with only four sampling steps, showcasing the potential for real-time synthesis.
Score: 52.3311704118393
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Consistency models have demonstrated powerful capability in efficient image generation and allowed synthesis within a few sampling steps, alleviating the high computational cost in diffusion models. However, the consistency model in the more challenging and resource-consuming video generation is still less explored. In this report, we present the VideoLCM framework to fill this gap, which leverages the concept of consistency models from image generation to efficiently synthesize videos with minimal steps while maintaining high quality. VideoLCM builds upon existing latent video diffusion models and incorporates consistency distillation techniques for training the latent consistency model. Experimental results reveal the effectiveness of our VideoLCM in terms of computational efficiency, fidelity and temporal consistency. Notably, VideoLCM achieves high-fidelity and smooth video synthesis with only four sampling steps, showcasing the potential for real-time synthesis. We hope that VideoLCM can serve as a simple yet effective baseline for subsequent research. The source code and models will be publicly available.

Related papers

Real-time One-Step Diffusion-based Expressive Portrait Videos Generation [85.07446744308247]
We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars. Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster.
arXiv Detail & Related papers (2024-12-18T03:42:42Z)
JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation [6.463753697299011]
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality temporally coherent videos. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
arXiv Detail & Related papers (2024-09-21T13:59:50Z)
ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning [36.378348127629195]
We propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations. Our approach augments the model's capacity to generate up to $5times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos.
arXiv Detail & Related papers (2024-06-20T09:18:54Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z)
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference [60.32804641276217]
We propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs. A high-quality 768 x 768 24-step LCM takes only 32 A100 GPU hours for training. We also introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets.
arXiv Detail & Related papers (2023-10-06T17:11:58Z)
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z)
Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes. We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z)
Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM) PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.