Related papers: VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

URL: http://arxiv.org/abs/2309.15091v2
Date: Fri, 12 Jul 2024 18:03:29 GMT
Title: VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
Authors: Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal,
Abstract summary: VideoDirectorGPT is a novel framework for consistent multi-scene video generation. Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
Score: 62.51232333352754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent text-to-video (T2V) generation methods have seen significant advancements. However, the majority of these works focus on producing short video clips of a single event (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules. This prompts an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which includes the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities. Next, guided by this video plan, our video generator, named Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities across multiple scenes, while being trained only with image-level annotations. Our experiments demonstrate that our proposed VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with consistency, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. Detailed ablation studies, including dynamic adjustment of layout control strength with an LLM and video generation with user-provided images, confirm the effectiveness of each component of our framework and its future potential.

Related papers

Frame-Level Captions for Long Video Generation with Complex Multi Scenes [52.12699618126831]
We propose a novel way to annotate datasets at the frame-level.<n>This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely.<n>Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly.
arXiv Detail & Related papers (2025-05-27T07:39:43Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects. We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a framework that dynamically retrieves videos based on their relevance with queries. VideoRAG is powered by recent Large Video Language Models (LVLMs) We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z)
VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation [48.318567065609216]
VAST (Video As Storyboard from Text) is a framework to generate high-quality videos from textual descriptions. By decoupling text understanding from video generation, VAST enables precise control over subject dynamics and scene composition. Experiments on the VBench benchmark demonstrate that VAST outperforms existing methods in both visual quality and semantic expression.
arXiv Detail & Related papers (2024-12-21T15:59:07Z)
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM [2.387054460181102]
We introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Our extensive experimental analysis across ten zero-shot video question answering benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks.
arXiv Detail & Related papers (2024-03-27T09:48:23Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VideoStudio: Generating Consistent-Content and Multi-Scene Videos [88.88118783892779]
VideoStudio is a framework for consistent-content and multi-scene video generation. VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script. VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
arXiv Detail & Related papers (2024-01-02T15:56:48Z)
VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities. Video LLMs can only provide a coarse description of the entire video. We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z)
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning [47.61090084143284]
VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.