Control-A-Video: Controllable Text-to-Video Generation with Diffusion
Models
- URL: http://arxiv.org/abs/2305.13840v2
- Date: Wed, 6 Dec 2023 14:03:00 GMT
- Title: Control-A-Video: Controllable Text-to-Video Generation with Diffusion
Models
- Authors: Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin
Xia, Xuefeng Xiao, Liang Lin
- Abstract summary: We present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis.
For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation.
Our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control.
- Score: 52.512109160994655
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in diffusion models have unlocked unprecedented abilities
in visual creation. However, current text-to-video generation models struggle
with the trade-off among movement range, action coherence and object
consistency. To mitigate this issue, we present a controllable text-to-video
(T2V) diffusion model, called Control-A-Video, capable of maintaining
consistency while customizable video synthesis. Based on a pre-trained
conditional text-to-image (T2I) diffusion model, our model aims to generate
videos conditioned on a sequence of control signals, such as edge or depth
maps. For the purpose of improving object consistency, Control-A-Video
integrates motion priors and content priors into video generation. We propose
two motion-adaptive noise initialization strategies, which are based on pixel
residual and optical flow, to introduce motion priors from input videos,
producing more coherent videos. Moreover, a first-frame conditioned controller
is proposed to generate videos from content priors of the first frame, which
facilitates the semantic alignment with text and allows longer video generation
in an auto-regressive manner. With the proposed architecture and strategies,
our model achieves resource-efficient convergence and generate consistent and
coherent videos with fine-grained control. Extensive experiments demonstrate
its success in various video generative tasks such as video editing and video
style transfer, outperforming previous methods in terms of consistency and
quality.
Related papers
- VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates.
Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - AnimateLCM: Accelerating the Animation of Personalized Diffusion Models
and Adapters with Decoupled Consistency Learning [47.681633892135125]
We propose AnimateLCM, allowing for high-fidelity video generation within minimal steps.
Instead of directly conducting consistency learning on the raw video dataset, we propose a decoupled consistency learning strategy.
We validate the proposed strategy in image-conditioned video generation and layout-conditioned video generation, all achieving top-performing results.
arXiv Detail & Related papers (2024-02-01T16:58:11Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Sketching the Future (STF): Applying Conditional Control Techniques to
Text-to-Video Models [0.0]
We propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models.
Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames.
arXiv Detail & Related papers (2023-05-10T02:33:25Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.