Related papers: VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

URL: http://arxiv.org/abs/2401.01256v1
Date: Tue, 2 Jan 2024 15:56:48 GMT
Title: VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
Authors: Fuchen Long and Zhaofan Qiu and Ting Yao and Tao Mei
Abstract summary: We propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script. VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
Score: 97.09631253302987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. Technically, VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoDrafter identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoDrafter outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.

Related papers

VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a framework that dynamically retrieves videos based on their relevance with queries. VideoRAG is powered by recent Large Video Language Models (LVLMs) We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z)
Video Diffusion Transformers are In-Context Learners [31.736838809714726]
This paper investigates a solution for enabling in-context capabilities of video diffusion transformers. We propose a simple pipeline to leverage in-context generation: ($textbfii$) videos along spacial or time dimension. Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems.
arXiv Detail & Related papers (2024-12-14T10:39:55Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process. Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z)
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning [47.61090084143284]
VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation. Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images [92.13079696503803]
We present MovieFactory, a framework to generate cinematic-picture (3072$times$1280), film-style (multi-scene), and multi-modality (sounding) movies. Our approach empowers users to create captivating movies with smooth transitions using simple text inputs.
arXiv Detail & Related papers (2023-06-12T17:31:23Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.