VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2403.05438v1
- Date: Fri, 8 Mar 2024 16:44:54 GMT
- Title: VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
- Authors: Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong
Xie, Xiangyang Ji, Wangmeng Zuo
- Abstract summary: Text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment.
We introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I.
- Score: 94.25084162939488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image diffusion models (T2I) have demonstrated unprecedented
capabilities in creating realistic and aesthetic images. On the contrary,
text-to-video diffusion models (T2V) still lag far behind in frame quality and
text alignment, owing to insufficient quality and quantity of training videos.
In this paper, we introduce VideoElevator, a training-free and plug-and-play
method, which elevates the performance of T2V using superior capabilities of
T2I. Different from conventional T2V sampling (i.e., temporal and spatial
modeling), VideoElevator explicitly decomposes each sampling step into temporal
motion refining and spatial quality elevating. Specifically, temporal motion
refining uses encapsulated T2V to enhance temporal consistency, followed by
inverting to the noise distribution required by T2I. Then, spatial quality
elevating harnesses inflated T2I to directly predict less noisy latent, adding
more photo-realistic details. We have conducted experiments in extensive
prompts under the combination of various T2V and T2I. The results show that
VideoElevator not only improves the performance of T2V baselines with
foundational T2I, but also facilitates stylistic video synthesis with
personalized T2I. Our code is available at
https://github.com/YBYBZhang/VideoElevator.
Related papers
- Still-Moving: Customized Video Generation without Customized Video Data [81.09302547183155]
We introduce Still-Moving, a novel framework for customizing a text-to-video (T2V) model.
The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model.
We train lightweight $textitSpatial Adapters$ that adjust the features produced by the injected T2I layers.
arXiv Detail & Related papers (2024-07-11T17:06:53Z) - Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers [69.96398489841116]
We introduce the Lumina-T2X family of Flow-based Large Diffusion Transformers (Flag-DiT)
Flag-DiT is a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions.
This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model.
arXiv Detail & Related papers (2024-05-09T17:35:16Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.