Captain Cinema: Towards Short Movie Generation
- URL: http://arxiv.org/abs/2507.18634v1
- Date: Thu, 24 Jul 2025 17:59:56 GMT
- Title: Captain Cinema: Towards Short Movie Generation
- Authors: Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, Lu Jiang,
- Abstract summary: We present Captain Cinema, a generation framework for short movie generation.<n>Our approach generates a sequence of synthesiss that outline the entire narrative.<n>Our model is trained on a specially curated dataset consisting of interleaved data pairs.
- Score: 66.22442526026215
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai
Related papers
- CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition [23.795982778641573]
We present CineVerse, a novel framework for the task of cinematic scene composition.<n>Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames.<n>Our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects.
arXiv Detail & Related papers (2025-04-28T15:28:14Z) - Automated Movie Generation via Multi-Agent CoT Planning [20.920129008402718]
MovieAgent is an automated movie generation via multi-agent Chain of Thought (CoT) planning.<n>It generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio.<n>By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline.
arXiv Detail & Related papers (2025-03-10T13:33:27Z) - MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation [62.85764872989189]
There is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models.<n>We present MovieBench: A Hierarchical Movie-Level dataset for Long Video Generation.<n>The dataset will be public and continuously maintained, aiming to advance the field of long video generation.
arXiv Detail & Related papers (2024-11-22T10:25:08Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph [6.980991481207376]
We introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph)<n>The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay's content.
arXiv Detail & Related papers (2024-10-18T17:56:11Z) - MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences.
We introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration.
Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation.
arXiv Detail & Related papers (2024-04-20T13:15:27Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Movie Summarization via Sparse Graph Construction [65.16768855902268]
We propose a model that identifies TP scenes by building a sparse movie graph that represents relations between scenes and is constructed using multimodal information.
According to human judges, the summaries created by our approach are more informative and complete, and receive higher ratings, than the outputs of sequence-based models and general-purpose summarization algorithms.
arXiv Detail & Related papers (2020-12-14T13:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.