CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition
- URL: http://arxiv.org/abs/2504.19894v1
- Date: Mon, 28 Apr 2025 15:28:14 GMT
- Title: CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition
- Authors: Quynh Phung, Long Mai, Fabian David Caba Heilbron, Feng Liu, Jia-Bin Huang, Cusuh Ham,
- Abstract summary: We present CineVerse, a novel framework for the task of cinematic scene composition.<n>Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames.<n>Our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects.
- Score: 23.795982778641573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present CineVerse, a novel framework for the task of cinematic scene composition. Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames. However, our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects. In order to learn to generate such content, we first create the CineVerse dataset. We use this dataset to train our proposed two-stage approach. First, we prompt a large language model (LLM) with task-specific instructions to take in a high-level scene description and generate a detailed plan for the overall setting and characters, as well as the individual shots. Then, we fine-tune a text-to-image generation model to synthesize high-quality visual keyframes. Experimental results demonstrate that CineVerse yields promising improvements in generating visually coherent and contextually rich movie scenes, paving the way for further exploration in cinematic video synthesis.
Related papers
- Can video generation replace cinematographers? Research on the cinematic language of generated video [31.0131670022777]
We propose a threefold approach to improve cinematic control in text-to-video (T2V) models.<n>First, we introduce a meticulously annotated cinematic language dataset with twenty subcategories, covering shot framing, shot angles, and camera movements.<n>Second, we present CameraDiff, which employs LoRA for precise and stable cinematic control, ensuring flexible shot generation.<n>Third, we propose CameraCLIP, designed to evaluate cinematic alignment and guide multi-shot composition.
arXiv Detail & Related papers (2024-12-16T09:02:24Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation [60.07447565026327]
We propose DreamRunner, a novel story-to-video generation method.<n>We structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning and fine-grained object-level layout and motion planning.<n>DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos.
arXiv Detail & Related papers (2024-11-25T18:41:56Z) - DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph [6.980991481207376]
We introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph)
The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay's content.
arXiv Detail & Related papers (2024-10-18T17:56:11Z) - MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences.
We introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration.
Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation.
arXiv Detail & Related papers (2024-04-20T13:15:27Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - TeViS:Translating Text Synopses to Video Storyboards [30.388090248346504]
We propose a new task called Text synopsis to Video Storyboard (TeViS)
It aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis.
VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation.
arXiv Detail & Related papers (2022-12-31T06:32:36Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - A Local-to-Global Approach to Multi-modal Movie Scene Segmentation [95.34033481442353]
We build a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies.
We propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods.
arXiv Detail & Related papers (2020-04-06T13:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.