MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
- URL: http://arxiv.org/abs/2602.23228v1
- Date: Thu, 26 Feb 2026 17:08:08 GMT
- Title: MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
- Authors: Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang,
- Abstract summary: MovieTeller is a novel framework for generating movie synopses via tool-augmented progressive abstraction.<n>Our core contribution is a training-free, tool-augmented, fact-grounded generation process.<n> Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence.
- Score: 33.39285561943112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
Related papers
- Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation [15.004606775581356]
LAVES is a hierarchical multi-agent system for generating high-quality instructional videos from educational problems.<n>In large-scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost.
arXiv Detail & Related papers (2026-02-12T10:14:36Z) - The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation [95.18045807704284]
We introduce an end-to-end agentic framework for dialogue-to-cinematic-video generation.<n> ScripterAgent is trained to translate coarse dialogue into a fine-grained, executable cinematic script.<n>Our framework significantly improves script faithfulness and temporal fidelity across all tested video models.
arXiv Detail & Related papers (2026-01-25T08:10:28Z) - Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories [5.022547031373416]
We introduce a method that approaches video generation in a filmmaker-like manner.<n>Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script.<n>This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually.
arXiv Detail & Related papers (2025-12-17T18:10:27Z) - AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z) - ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation [48.59900036213667]
Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image.<n>We introduce ID-Composer, a novel framework that tackles multi-subject video generation from a text prompt and reference images.
arXiv Detail & Related papers (2025-11-01T11:29:14Z) - DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph [6.980991481207376]
We introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph)<n>The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay's content.
arXiv Detail & Related papers (2024-10-18T17:56:11Z) - MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.