Related papers: From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

URL: http://arxiv.org/abs/2507.02790v1
Date: Thu, 03 Jul 2025 16:54:32 GMT
Title: From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
Authors: Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu,
Abstract summary: We propose a human-inspired automatic video editing framework (HIVE)<n>Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models.<n>Our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks.
Score: 17.769963004697047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

Related papers

Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models [10.585096070697348]
We introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions.<n>VideoNarrator addresses challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models can function as caption generators.<n>Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations.
arXiv Detail & Related papers (2025-07-22T22:16:37Z)
REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing [56.992916488077476]
In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video.<n>We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative.<n>Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative.
arXiv Detail & Related papers (2025-05-24T21:36:49Z)
Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs [6.300563383392837]
The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing.<n>We propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing.
arXiv Detail & Related papers (2025-01-10T11:35:43Z)
Text-Video Multi-Grained Integration for Video Moment Montage [13.794791614348084]
A new task called Video Moment Montage (VMM) aims to accurately locate the corresponding video segments based on a pre-provided narration text.<n>We present a novel textitText-Video Multi-Grained Integration method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features.
arXiv Detail & Related papers (2024-12-12T13:40:59Z)
Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos. We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames. Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z)
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks. Our model can edit and translate the desired results within seconds based on user instructions. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising [43.35391175319815]
This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. We introduce a novel paradigm dubbed Gen-L-Video, capable of extending off-the-shelf short video diffusion models. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models.
arXiv Detail & Related papers (2023-05-29T17:38:18Z)
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts [116.05656635044357]
We propose a generic video editing framework called Make-A-Protagonist. Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model. Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
arXiv Detail & Related papers (2023-05-15T17:59:03Z)
Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.