SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
- URL: http://arxiv.org/abs/2311.16933v1
- Date: Tue, 28 Nov 2023 16:33:08 GMT
- Title: SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
- Authors: Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, Bo Dai
- Abstract summary: We present SparseCtrl to enable flexible structure control with temporally sparse signals.
It incorporates an additional condition to process these sparse signals while leaving the pre-trained T2V model untouched.
The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images.
- Score: 84.71887272654865
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The development of text-to-video (T2V), i.e., generating videos with a given
text prompt, has been significantly advanced in recent years. However, relying
solely on text prompts often results in ambiguous frame composition due to
spatial uncertainty. The research community thus leverages the dense structure
signals, e.g., per-frame depth/edge sequences, to enhance controllability,
whose collection accordingly increases the burden of inference. In this work,
we present SparseCtrl to enable flexible structure control with temporally
sparse signals, requiring only one or a few inputs, as shown in Figure 1. It
incorporates an additional condition encoder to process these sparse signals
while leaving the pre-trained T2V model untouched. The proposed approach is
compatible with various modalities, including sketches, depth maps, and RGB
images, providing more practical control for video generation and promoting
applications such as storyboarding, depth rendering, keyframe animation, and
interpolation. Extensive experiments demonstrate the generalization of
SparseCtrl on both original and personalized T2V generators. Codes and models
will be publicly available at https://guoyww.github.io/projects/SparseCtrl .
Related papers
- xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.