ControlVideo: Training-free Controllable Text-to-Video Generation
- URL: http://arxiv.org/abs/2305.13077v1
- Date: Mon, 22 May 2023 14:48:53 GMT
- Title: ControlVideo: Training-free Controllable Text-to-Video Generation
- Authors: Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng
Zuo, Qi Tian
- Abstract summary: ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
- Score: 117.06302461557044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-driven diffusion models have unlocked unprecedented abilities in image
generation, whereas their video counterpart still lags behind due to the
excessive training cost of temporal modeling. Besides the training burden, the
generated videos also suffer from appearance inconsistency and structural
flickers, especially in long video synthesis. To address these challenges, we
design a \emph{training-free} framework called \textbf{ControlVideo} to enable
natural and efficient text-to-video generation. ControlVideo, adapted from
ControlNet, leverages coarsely structural consistency from input motion
sequences, and introduces three modules to improve video generation. Firstly,
to ensure appearance coherence between frames, ControlVideo adds fully
cross-frame interaction in self-attention modules. Secondly, to mitigate the
flicker effect, it introduces an interleaved-frame smoother that employs frame
interpolation on alternated frames. Finally, to produce long videos
efficiently, it utilizes a hierarchical sampler that separately synthesizes
each short clip with holistic coherency. Empowered with these modules,
ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs
quantitatively and qualitatively. Notably, thanks to the efficient designs, it
generates both short and long videos within several minutes using one NVIDIA
2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.
Related papers
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture [11.587428534308945]
EasyAnimate is an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes.
We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block.
We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training, and end-to-end video inference.
arXiv Detail & Related papers (2024-05-29T11:11:07Z) - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z) - LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing.
We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts.
Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - ControlVideo: Conditional Control for One-shot Text-driven Video Editing
and Beyond [45.188722895165505]
ControlVideo generates a video that aligns with a given text while preserving the structure of the source video.
Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency.
arXiv Detail & Related papers (2023-05-26T17:13:55Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.