Related papers: Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

URL: http://arxiv.org/abs/2307.06940v1
Date: Thu, 13 Jul 2023 17:57:13 GMT
Title: Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Authors: Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen
Abstract summary: We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
Score: 69.20173154096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.

Related papers

VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation [48.318567065609216]
VAST (Video As Storyboard from Text) is a framework to generate high-quality videos from textual descriptions. By decoupling text understanding from video generation, VAST enables precise control over subject dynamics and scene composition. Experiments on the VBench benchmark demonstrate that VAST outperforms existing methods in both visual quality and semantic expression.
arXiv Detail & Related papers (2024-12-21T15:59:07Z)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance [36.26032505627126]
Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. In this paper, we explore customized video generation by utilizing text as context description and motion structure. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model.
arXiv Detail & Related papers (2023-06-01T17:43:27Z)
TaleCrafter: Interactive Story Visualization with Multiple Characters [49.14122401339003]
This paper proposes a system for generic interactive story visualization. It is capable of handling multiple novel characters and supporting the editing of layout and local structure. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-generation (T2L), controllable text-to-image generation (C-T2I) and image-to-video animation (I2V)
arXiv Detail & Related papers (2023-05-29T17:11:39Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings. We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.