Control-A-Video: Controllable Text-to-Video Generation with Diffusion
Models
- URL: http://arxiv.org/abs/2305.13840v2
- Date: Wed, 6 Dec 2023 14:03:00 GMT
- Title: Control-A-Video: Controllable Text-to-Video Generation with Diffusion
Models
- Authors: Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin
Xia, Xuefeng Xiao, Liang Lin
- Abstract summary: We present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis.
For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation.
Our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control.
- Score: 52.512109160994655
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in diffusion models have unlocked unprecedented abilities
in visual creation. However, current text-to-video generation models struggle
with the trade-off among movement range, action coherence and object
consistency. To mitigate this issue, we present a controllable text-to-video
(T2V) diffusion model, called Control-A-Video, capable of maintaining
consistency while customizable video synthesis. Based on a pre-trained
conditional text-to-image (T2I) diffusion model, our model aims to generate
videos conditioned on a sequence of control signals, such as edge or depth
maps. For the purpose of improving object consistency, Control-A-Video
integrates motion priors and content priors into video generation. We propose
two motion-adaptive noise initialization strategies, which are based on pixel
residual and optical flow, to introduce motion priors from input videos,
producing more coherent videos. Moreover, a first-frame conditioned controller
is proposed to generate videos from content priors of the first frame, which
facilitates the semantic alignment with text and allows longer video generation
in an auto-regressive manner. With the proposed architecture and strategies,
our model achieves resource-efficient convergence and generate consistent and
coherent videos with fine-grained control. Extensive experiments demonstrate
its success in various video generative tasks such as video editing and video
style transfer, outperforming previous methods in terms of consistency and
quality.
Related papers
- LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer [10.44905923812975]
We propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation.<n>Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos.<n>Our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability.
arXiv Detail & Related papers (2025-05-20T10:18:29Z) - Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization [63.37161241355025]
Video-MSG is a training-free method for T2V generation based on Multimodal planning and Structured noise initialization.
It guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising.
Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models.
arXiv Detail & Related papers (2025-04-11T15:41:43Z) - Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model [133.01510927611452]
We present Step-Video-T2V, a text-to-video pre-trained model with 30Bational parameters and the ability to generate videos up to 204 frames in length.
A deep compression Vari Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios.
Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality.
arXiv Detail & Related papers (2025-02-14T15:58:10Z) - VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
VideoTetris is a framework that enables compositional T2V generation.
We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
arXiv Detail & Related papers (2024-06-06T17:25:33Z) - MoVideo: Motion-Aware Video Generation with Diffusion Models [97.03352319694795]
We propose a novel motion-aware generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow.
MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
arXiv Detail & Related papers (2023-11-19T13:36:03Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [33.37279673304]
We introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text.
ConditionVideo generates realistic dynamic videos from random noise or given scene videos.
Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
arXiv Detail & Related papers (2023-10-11T17:46:28Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.