EditDuet: A Multi-Agent System for Video Non-Linear Editing
- URL: http://arxiv.org/abs/2509.10761v1
- Date: Sat, 13 Sep 2025 00:27:02 GMT
- Title: EditDuet: A Multi-Agent System for Video Non-Linear Editing
- Authors: Marcelo Sandoval-Castaneda, Bryan Russell, Josef Sivic, Gregory Shakhnarovich, Fabian Caba Heilbron,
- Abstract summary: We propose to automate the core task of video editing, formulating it as sequential decision making process.<n>The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence.<n>We evaluate our system's output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference.
- Score: 24.334561615501105
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system's output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference.
Related papers
- ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing [12.967240894970098]
Shot assembly is a crucial step in film production and video editing.<n>Traditionally, this process has been manually executed by experienced editors.<n>We propose an energy-based optimization method for video shot assembly.
arXiv Detail & Related papers (2025-11-04T11:48:22Z) - In-Context Learning with Unpaired Clips for Instruction-based Video Editing [51.943707933717185]
We introduce a low-cost pretraining strategy for instruction-based video editing.<n>Our framework first pretrains on approximately 1M real video clips to learn basic editing concepts.<n>Our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity.
arXiv Detail & Related papers (2025-10-16T13:02:11Z) - From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding [17.769963004697047]
We propose a human-inspired automatic video editing framework (HIVE)<n>Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models.<n>Our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks.
arXiv Detail & Related papers (2025-07-03T16:54:32Z) - UNIC: Unified In-Context Video Editing [76.76077875564526]
UNified In-Context Video Editing (UNIC) is a framework that unifies diverse video editing tasks within a single model in an in-context manner.<n>We introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks.<n>Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
arXiv Detail & Related papers (2025-06-04T17:57:43Z) - VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model [10.736207095604414]
We propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM)
We also propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions.
arXiv Detail & Related papers (2024-11-07T18:20:28Z) - ExpressEdit: Video Editing with Natural Language and Sketching [28.814923641627825]
multimodality$-$natural language (NL) and sketching are natural modalities humans use for expression$-$can be utilized to support video editors.
We present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame.
arXiv Detail & Related papers (2024-03-26T13:34:21Z) - Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions [49.14827857853878]
ReimaginedAct comprises video understanding, reasoning, and editing modules.
Our method can accept not only direct instructional text prompts but also what if' questions to predict possible action changes.
arXiv Detail & Related papers (2024-03-11T22:46:46Z) - LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video
Editing [23.010237004536485]
Large language models (LLMs) can be integrated into the video editing workflow to reduce barriers to beginners.
LAVE is a novel system that provides LLM-powered agent assistance and language-augmented editing features.
Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness.
arXiv Detail & Related papers (2024-02-15T19:53:11Z) - Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts [116.05656635044357]
We propose a generic video editing framework called Make-A-Protagonist.
Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model.
Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
arXiv Detail & Related papers (2023-05-15T17:59:03Z) - The Anatomy of Video Editing: A Dataset and Benchmark Suite for
AI-Assisted Video Editing [90.59584961661345]
This work introduces the Anatomy of Video Editing, a dataset, and benchmark, to foster research in AI-assisted video editing.
Our benchmark suite focuses on video editing tasks, beyond visual effects, such as automatic footage organization and assisted video assembling.
To enable research on these fronts, we annotate more than 1.5M tags, with relevant concepts to cinematography, from 196176 shots sampled from movie scenes.
arXiv Detail & Related papers (2022-07-20T10:53:48Z) - Intelligent Video Editing: Incorporating Modern Talking Face Generation
Algorithms in a Video Editor [44.36920938661454]
This paper proposes a video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities.
Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively.
Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality.
arXiv Detail & Related papers (2021-10-16T14:19:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.