VideoBooth: Diffusion-based Video Generation with Image Prompts
- URL: http://arxiv.org/abs/2312.00777v1
- Date: Fri, 1 Dec 2023 18:55:40 GMT
- Title: VideoBooth: Diffusion-based Video Generation with Image Prompts
- Authors: Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu
Qiao, Chen Change Loy, Ziwei Liu
- Abstract summary: We propose a feed-forward framework for video generation with image prompts.
VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts.
- Score: 130.47771531413375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven video generation witnesses rapid progress. However, merely using
text prompts is not enough to depict the desired subject appearance that
accurately aligns with users' intents, especially for customized content
creation. In this paper, we study the task of video generation with image
prompts, which provide more accurate and direct content control beyond the text
prompts. Specifically, we propose a feed-forward framework VideoBooth, with two
dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine
manner. Coarse visual embeddings from image encoder provide high-level
encodings of image prompts, while fine visual embeddings from the proposed
attention injection module provide multi-scale and detailed encoding of image
prompts. These two complementary embeddings can faithfully capture the desired
appearance. 2) In the attention injection module at fine level, multi-scale
image prompts are fed into different cross-frame attention layers as additional
keys and values. This extra spatial information refines the details in the
first frame and then it is propagated to the remaining frames, which maintains
temporal consistency. Extensive experiments demonstrate that VideoBooth
achieves state-of-the-art performance in generating customized high-quality
videos with subjects specified in image prompts. Notably, VideoBooth is a
generalizable framework where a single model works for a wide range of image
prompts with feed-forward pass.
Related papers
- MAMS: Model-Agnostic Module Selection Framework for Video Captioning [11.442879458679144]
Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges.
This paper proposes the first model-agnostic module selection framework in video captioning.
Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.
arXiv Detail & Related papers (2025-01-30T11:10:18Z) - Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Accurate and Fast Compressed Video Captioning [28.19362369787383]
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process.
We study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline.
We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning.
arXiv Detail & Related papers (2023-09-22T13:43:22Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.