Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
- URL: http://arxiv.org/abs/2509.16811v2
- Date: Sun, 28 Sep 2025 07:22:30 GMT
- Title: Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
- Authors: Zihan Ding, Xinyi Wang, Junlong Chen, Per Ola Kristensson, Junxiao Shen,
- Abstract summary: We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines.<n>At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion.<n>Our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.
- Score: 35.60423976124236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Creators struggle to edit long-form, narrative-rich videos not because of UI complexity, but due to the cognitive demands of searching, storyboarding, and sequencing hours of footage. Existing transcript- or embedding-based methods fall short for creative workflows, as models struggle to track characters, infer motivations, and connect dispersed events. We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines. At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion, producing interpretable traces of plot, dialogue, emotion, and context. Users receive cinematic edits while optionally refining transparent intermediate outputs. Evaluated on 400+ videos with expert ratings, QA, and preference studies, our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.
Related papers
- EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers [3.3508228801277853]
We introduce EditYourself, a DiTT-based framework for audio-driven video-to-videoV editing.<n>It enables transcript-based modification of talking videos, including the seamless addition, removal, and retiming of visually spoken content.<n>This represents a step toward generative video models as practical tools for professional video post-production.
arXiv Detail & Related papers (2026-01-29T18:49:27Z) - CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation [9.91271343855315]
CoAgent is a framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline.<n>A Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues.<n>A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots.<n>A pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow.
arXiv Detail & Related papers (2025-12-27T09:38:34Z) - Cut2Next: Generating Next Shot via In-Context Tuning [93.14744132897428]
Multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity.<n>Current methods often prioritize basic visual consistency, neglecting crucial editing patterns.<n>We introduce Next Shot Generation (NSG): a subsequent, high-quality shot that critically synthesizes professional editing patterns.
arXiv Detail & Related papers (2025-08-11T17:56:59Z) - From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding [17.769963004697047]
We propose a human-inspired automatic video editing framework (HIVE)<n>Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models.<n>Our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks.
arXiv Detail & Related papers (2025-07-03T16:54:32Z) - Text2Story: Advancing Video Storytelling with Text Guidance [20.51001299249891]
We introduce a novel AI-empowered storytelling framework to enable seamless video generation with natural action transitions and structured narratives.<n>We first present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video.<n>We then introduce a dynamics-informed prompt weighting mechanism that adaptively adjusts the influence of scene and action prompts at each diffusion timestep.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.