HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
- URL: http://arxiv.org/abs/2510.20822v1
- Date: Thu, 23 Oct 2025 17:59:59 GMT
- Title: HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
- Authors: Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu,
- Abstract summary: HoloCine is a model that generates entire scenes holistically to ensure global consistency from the first shot to the last.<n>Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots.<n>Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.
- Score: 97.61653035827919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
Related papers
- The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation [95.18045807704284]
We introduce an end-to-end agentic framework for dialogue-to-cinematic-video generation.<n> ScripterAgent is trained to translate coarse dialogue into a fine-grained, executable cinematic script.<n>Our framework significantly improves script faithfulness and temporal fidelity across all tested video models.
arXiv Detail & Related papers (2026-01-25T08:10:28Z) - OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory [47.073128448877775]
We propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation.<n>OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning.<n>OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings.
arXiv Detail & Related papers (2025-12-08T18:32:24Z) - Captain Cinema: Towards Short Movie Generation [66.22442526026215]
We present Captain Cinema, a generation framework for short movie generation.<n>Our approach generates a sequence of synthesiss that outline the entire narrative.<n>Our model is trained on a specially curated dataset consisting of interleaved data pairs.
arXiv Detail & Related papers (2025-07-24T17:59:56Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.