Related papers: Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media

URL: http://arxiv.org/abs/2509.16811v2
Date: Sun, 28 Sep 2025 07:22:30 GMT
Title: Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Authors: Zihan Ding, Xinyi Wang, Junlong Chen, Per Ola Kristensson, Junxiao Shen,
Abstract summary: We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines.<n>At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion.<n>Our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.
Score: 35.60423976124236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Creators struggle to edit long-form, narrative-rich videos not because of UI complexity, but due to the cognitive demands of searching, storyboarding, and sequencing hours of footage. Existing transcript- or embedding-based methods fall short for creative workflows, as models struggle to track characters, infer motivations, and connect dispersed events. We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines. At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion, producing interpretable traces of plot, dialogue, emotion, and context. Users receive cinematic edits while optionally refining transparent intermediate outputs. Evaluated on 400+ videos with expert ratings, QA, and preference studies, our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.

Related papers

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers [3.3508228801277853]
We introduce EditYourself, a DiTT-based framework for audio-driven video-to-videoV editing.<n>It enables transcript-based modification of talking videos, including the seamless addition, removal, and retiming of visually spoken content.<n>This represents a step toward generative video models as practical tools for professional video post-production.
arXiv Detail & Related papers (2026-01-29T18:49:27Z)
CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation [9.91271343855315]
CoAgent is a framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline.<n>A Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues.<n>A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots.<n>A pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow.
arXiv Detail & Related papers (2025-12-27T09:38:34Z)
Cut2Next: Generating Next Shot via In-Context Tuning [93.14744132897428]
Multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity.<n>Current methods often prioritize basic visual consistency, neglecting crucial editing patterns.<n>We introduce Next Shot Generation (NSG): a subsequent, high-quality shot that critically synthesizes professional editing patterns.
arXiv Detail & Related papers (2025-08-11T17:56:59Z)
From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding [17.769963004697047]
We propose a human-inspired automatic video editing framework (HIVE)<n>Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models.<n>Our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks.
arXiv Detail & Related papers (2025-07-03T16:54:32Z)
Text2Story: Advancing Video Storytelling with Text Guidance [20.51001299249891]
We introduce a novel AI-empowered storytelling framework to enable seamless video generation with natural action transitions and structured narratives.<n>We first present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video.<n>We then introduce a dynamics-informed prompt weighting mechanism that adaptively adjusts the influence of scene and action prompts at each diffusion timestep.
arXiv Detail & Related papers (2025-03-08T19:04:36Z)
VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z)
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence. We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z)
Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.