The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
- URL: http://arxiv.org/abs/2601.17737v2
- Date: Tue, 27 Jan 2026 02:50:46 GMT
- Title: The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
- Authors: Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus,
- Abstract summary: We introduce an end-to-end agentic framework for dialogue-to-cinematic-video generation.<n> ScripterAgent is trained to translate coarse dialogue into a fine-grained, executable cinematic script.<n>Our framework significantly improves script faithfulness and temporal fidelity across all tested video models.
- Score: 95.18045807704284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.
Related papers
- Bridging Your Imagination with Audio-Video Generation via a Unified Director [54.45375287950375]
We argue that logical reasoning and imaginative thinking are both fundamental qualities of a film director.<n>We propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts.
arXiv Detail & Related papers (2025-12-29T05:56:22Z) - HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives [97.61653035827919]
HoloCine is a model that generates entire scenes holistically to ensure global consistency from the first shot to the last.<n>Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots.<n>Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.
arXiv Detail & Related papers (2025-10-23T17:59:59Z) - Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling [15.410503589735699]
We propose Dialogue Visualization, a novel task that transforms dialogue scripts into dynamic, multi-view storyboards.<n>We introduce Dialogue Director, a training-free multimodal framework comprising a Script Director, Cinematographer, and Storyboard Maker.<n> Experimental results demonstrate that Dialogue Director outperforms state-of-the-art methods in script interpretation, physical world understanding, and cinematic principle application.
arXiv Detail & Related papers (2024-12-30T05:54:23Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.