Sketching the Future (STF): Applying Conditional Control Techniques to
Text-to-Video Models
- URL: http://arxiv.org/abs/2305.05845v1
- Date: Wed, 10 May 2023 02:33:25 GMT
- Title: Sketching the Future (STF): Applying Conditional Control Techniques to
Text-to-Video Models
- Authors: Rohan Dhesikan, Vignesh Rajmohan
- Abstract summary: We propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models.
Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The proliferation of video content demands efficient and flexible neural
network based approaches for generating new video content. In this paper, we
propose a novel approach that combines zero-shot text-to-video generation with
ControlNet to improve the output of these models. Our method takes multiple
sketched frames as input and generates video output that matches the flow of
these frames, building upon the Text-to-Video Zero architecture and
incorporating ControlNet to enable additional input conditions. By first
interpolating frames between the inputted sketches and then running
Text-to-Video Zero using the new interpolated frames video as the control
technique, we leverage the benefits of both zero-shot text-to-video generation
and the robust control provided by ControlNet. Experiments demonstrate that our
method excels at producing high-quality and remarkably consistent video content
that more accurately aligns with the user's intended motion for the subject
within the video. We provide a comprehensive resource package, including a demo
video, project website, open-source GitHub repository, and a Colab playground
to foster further research and application of our proposed method.
Related papers
- LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing.
We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts.
Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - ControlVideo: Conditional Control for One-shot Text-driven Video Editing
and Beyond [45.188722895165505]
ControlVideo generates a video that aligns with a given text while preserving the structure of the source video.
Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency.
arXiv Detail & Related papers (2023-05-26T17:13:55Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Non-Adversarial Video Synthesis with Learned Priors [53.26777815740381]
We focus on the problem of generating videos from latent noise vectors, without any reference input frames.
We develop a novel approach that jointly optimize the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning.
Our approach generates superior quality videos compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2020-03-21T02:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.