Video Background Music Generation: Dataset, Method and Evaluation
- URL: http://arxiv.org/abs/2211.11248v2
- Date: Fri, 4 Aug 2023 15:57:36 GMT
- Title: Video Background Music Generation: Dataset, Method and Evaluation
- Authors: Le Zhuo, Zhaokai Wang, Baisen Wang, Yue Liao, Chenxi Bao, Stanley
Peng, Songhao Han, Aixi Zhang, Fei Fang, Si Liu
- Abstract summary: We introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation.
We present SymMV, a video and symbolic music dataset with various musical annotations.
We also propose a benchmark video background music generation framework named V-MusProd.
- Score: 31.15901120245794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music is essential when editing videos, but selecting music manually is
difficult and time-consuming. Thus, we seek to automatically generate
background music tracks given video input. This is a challenging task since it
requires music-video datasets, efficient architectures for video-to-music
generation, and reasonable metrics, none of which currently exist. To close
this gap, we introduce a complete recipe including dataset, benchmark model,
and evaluation metric for video background music generation. We present SymMV,
a video and symbolic music dataset with various musical annotations. To the
best of our knowledge, it is the first video-music dataset with rich musical
annotations. We also propose a benchmark video background music generation
framework named V-MusProd, which utilizes music priors of chords, melody, and
accompaniment along with video-music relations of semantic, color, and motion
features. To address the lack of objective metrics for video-music
correspondence, we design a retrieval-based metric VMCP built upon a powerful
video-music representation learning model. Experiments show that with our
dataset, V-MusProd outperforms the state-of-the-art method in both music
quality and correspondence with videos. We believe our dataset, benchmark
model, and evaluation metric will boost the development of video background
music generation. Our dataset and code are available at
https://github.com/zhuole1025/SymMV.
Related papers
- VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [68.72384258320743]
We propose VidMuse, a framework for generating music aligned with video inputs.
VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z) - Diff-BGM: A Diffusion Model for Video Background Music Generation [16.94631443719866]
We propose a high-quality music-video dataset with detailed annotation and shot detection to provide multi-modal information about the video and music.
We then present evaluation metrics to assess music quality, including music diversity and alignment between music and video.
We propose the Diff-BGM framework to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process.
arXiv Detail & Related papers (2024-05-20T09:48:36Z) - MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding [39.149899771556704]
This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval.
MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data.
We construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos.
arXiv Detail & Related papers (2024-05-15T12:11:28Z) - Video2Music: Suitable Music Generation from Videos using an Affective
Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video.
In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [47.076283429992664]
V2Meow is a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types.
It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames.
arXiv Detail & Related papers (2023-05-11T06:26:41Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - InverseMV: Composing Piano Scores with a Convolutional Video-Music
Transformer [2.157478102241537]
We propose a novel attention-based model VMT that automatically generates piano scores from video frames.
Using music generated from models also prevent potential copyright infringements.
We release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and MIDI files.
arXiv Detail & Related papers (2021-12-31T06:39:28Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.