AutoMatch: A Large-scale Audio Beat Matching Benchmark for Boosting Deep
Learning Assistant Video Editing
- URL: http://arxiv.org/abs/2303.01884v1
- Date: Fri, 3 Mar 2023 12:30:09 GMT
- Title: AutoMatch: A Large-scale Audio Beat Matching Benchmark for Boosting Deep
Learning Assistant Video Editing
- Authors: Sen Pei, Jingya Yu, Qi Chen, Wozhou He
- Abstract summary: Short video resources can not be independent of the valuable editing work contributed by numerous video creators.
In this paper, we investigate audio beat matching (ABM), which aims to recommend the proper transition time stamps based on the background music.
This technique helps to ease the labor-intensive work during video editing, saving energy for creators so that they can focus more on the creativity of video content.
- Score: 7.672758847025309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The explosion of short videos has dramatically reshaped the manners people
socialize, yielding a new trend for daily sharing and access to the latest
information. These rich video resources, on the one hand, benefited from the
popularization of portable devices with cameras, but on the other, they can not
be independent of the valuable editing work contributed by numerous video
creators. In this paper, we investigate a novel and practical problem, namely
audio beat matching (ABM), which aims to recommend the proper transition time
stamps based on the background music. This technique helps to ease the
labor-intensive work during video editing, saving energy for creators so that
they can focus more on the creativity of video content. We formally define the
ABM problem and its evaluation protocol. Meanwhile, a large-scale audio
dataset, i.e., the AutoMatch with over 87k finely annotated background music,
is presented to facilitate this newly opened research direction. To further lay
solid foundations for the following study, we also propose a novel model termed
BeatX to tackle this challenging task. Alongside, we creatively present the
concept of label scope, which eliminates the data imbalance issues and assigns
adaptive weights for the ground truth during the training procedure in one
stop. Though plentiful short video platforms have flourished for a long time,
the relevant research concerning this scenario is not sufficient, and to the
best of our knowledge, AutoMatch is the first large-scale dataset to tackle the
audio beat matching problem. We hope the released dataset and our competitive
baseline can encourage more attention to this line of research. The dataset and
codes will be made publicly available.
Related papers
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval [4.722882736419499]
Cross-modal deep learning is used to learn joint embedding spaces that link the two distinct modalities - audio and sheet music images.
While there has been steady improvement on this front over the past years, a number of open problems still prevent large-scale employment of this methodology.
We identify a set of main challenges on the road towards robust and large-scale cross-modal music retrieval in real scenarios.
arXiv Detail & Related papers (2023-09-21T15:11:16Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR)
VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots.
We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z) - Modality-Balanced Embedding for Video Retrieval [21.81705847039759]
We identify a modality bias phenomenon that the video encoder almost entirely relies on text matching.
We propose MBVR (short for Modality Balanced Video Retrieval) with two key components.
We show empirically that our method is both effective and efficient in solving modality bias problem.
arXiv Detail & Related papers (2022-04-18T06:29:46Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.