When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach
- URL: http://arxiv.org/abs/2510.05661v1
- Date: Tue, 07 Oct 2025 08:18:27 GMT
- Title: When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach
- Authors: Daniel Gonzálbez-Biosca, Josep Cabacas-Maso, Carles Ventura, Ismael Benito-Altamirano,
- Abstract summary: We address the challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut.<n>Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut)<n>For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert.
- Score: 9.554646174100123
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.
Related papers
- AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z) - Video Object Segmentation-Aware Audio Generation [13.505371291069892]
Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley.<n>We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues.<n>Our method demonstrates substantial improvements over current state-of-the-art methods and sets a new standard for controllable, high-fidelity Foley synthesis.
arXiv Detail & Related papers (2025-09-30T17:49:41Z) - Enhancing Scene Transition Awareness in Video Generation via Post-Training [0.4199844472131921]
We propose the textbfTransition-Aware Video dataset, which consists of preprocessed video clips with multiple scene transitions.<n>Our experiment shows that post-training on the textbfTAV dataset improves prompt-based scene transition understanding, narrows the gap between required and generated scenes, and maintains image quality.
arXiv Detail & Related papers (2025-07-24T02:50:26Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR)
VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots.
We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z) - MovieCuts: A New Dataset and Benchmark for Cut Type Recognition [114.57935905189416]
This paper introduces the cut type recognition task, which requires modeling of multi-modal information.
We construct a large-scale dataset called MovieCuts, which contains more than 170K videoclips labeled among ten cut types.
Our best model achieves 45.7% mAP, which suggests that the task is challenging and that attaining highly accurate cut type recognition is an open research problem.
arXiv Detail & Related papers (2021-09-12T17:36:55Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [17.232631075144592]
Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm.
We propose a novel approach that segments and tracks instances across space and time in a single stage.
Our method achieves state-of-the-art results across multiple datasets and tasks.
arXiv Detail & Related papers (2020-03-18T18:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.