VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
- URL: http://arxiv.org/abs/2412.02259v1
- Date: Tue, 03 Dec 2024 08:33:50 GMT
- Title: VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
- Authors: Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim,
- Abstract summary: Current generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos.<n>We propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation.<n>Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
- Score: 70.61101071902596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. Existing models trained on large-scale data on the back of rich computational resources are unsurprisingly inadequate for maintaining a logical storyline and visual consistency across multiple shots of a cohesive script since they are often trained with a single-shot objective. To this end, we propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. VGoT is designed with three goals in mind as follows. Multi-Shot Video Generation: We divide the video generation process into a structured, modular sequence, including (1) Script Generation, which translates a curt story into detailed prompts for each shot; (2) Keyframe Generation, responsible for creating visually consistent keyframes faithful to character portrayals; and (3) Shot-Level Video Generation, which transforms information from scripts and keyframes into shots; (4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable Narrative Design: Inspired by cinematic scriptwriting, our prompt generation approach spans five key domains, ensuring logical consistency, character development, and narrative flow across the entire video. Cross-Shot Consistency: We ensure temporal and identity consistency by leveraging identity-preserving (IP) embeddings across shots, which are automatically created from the narrative. Additionally, we incorporate a cross-shot smoothing mechanism, which integrates a reset boundary that effectively combines latent features from adjacent shots, resulting in smooth transitions and maintaining visual coherence throughout the video. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
Related papers
- VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.
We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.
VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z) - Long Context Tuning for Video Generation [63.060794860098795]
Long Context Tuning (LCT) is a training paradigm that expands the context window of pre-trained single-shot video diffusion models.
Our method expands full attention mechanisms from individual shots to encompass all shots within a scene.
Experiments demonstrate coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension.
arXiv Detail & Related papers (2025-03-13T17:40:07Z) - Text2Story: Advancing Video Storytelling with Text Guidance [20.51001299249891]
We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives.
Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - VideoStudio: Generating Consistent-Content and Multi-Scene Videos [88.88118783892779]
VideoStudio is a framework for consistent-content and multi-scene video generation.
VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script.
VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
arXiv Detail & Related papers (2024-01-02T15:56:48Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Task-agnostic Temporally Consistent Facial Video Editing [84.62351915301795]
We propose a task-agnostic, temporally consistent facial video editing framework.
Based on a 3D reconstruction model, our framework is designed to handle several editing tasks in a more unified and disentangled manner.
Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth.
arXiv Detail & Related papers (2020-07-03T02:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.