VideoCrafter2: Overcoming Data Limitations for High-Quality Video
Diffusion Models
- URL: http://arxiv.org/abs/2401.09047v1
- Date: Wed, 17 Jan 2024 08:30:32 GMT
- Title: VideoCrafter2: Overcoming Data Limitations for High-Quality Video
Diffusion Models
- Authors: Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao
Weng, Ying Shan
- Abstract summary: We investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model.
We shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model.
- Score: 76.85329896854189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-video generation aims to produce a video based on a given prompt.
Recently, several commercial video models have been able to generate plausible
videos with minimal noise, excellent details, and high aesthetic scores.
However, these models rely on large-scale, well-filtered, high-quality videos
that are not accessible to the community. Many existing research works, which
train models using the low-quality WebVid-10M dataset, struggle to generate
high-quality videos because the models are optimized to fit WebVid-10M. In this
work, we explore the training scheme of video models extended from Stable
Diffusion and investigate the feasibility of leveraging low-quality videos and
synthesized high-quality images to obtain a high-quality video model. We first
analyze the connection between the spatial and temporal modules of video models
and the distribution shift to low-quality videos. We observe that full training
of all modules results in a stronger coupling between spatial and temporal
modules than only training temporal modules. Based on this stronger coupling,
we shift the distribution to higher quality without motion degradation by
finetuning spatial modules with high-quality images, resulting in a generic
high-quality video model. Evaluations are conducted to demonstrate the
superiority of the proposed method, particularly in picture quality, motion,
and concept composition.
Related papers
- SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model [41.825824810180215]
This paper presents SNED, a superposition network architecture search method for efficient video diffusion model.
Our method employs a supernet training paradigm that targets various model cost and resolution options.
Our framework consistently produces comparable results across different model options with high efficiency.
arXiv Detail & Related papers (2024-05-31T21:12:30Z) - Moonshot: Towards Controllable Video Generation and Editing with
Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.