GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
- URL: http://arxiv.org/abs/2501.09972v1
- Date: Fri, 17 Jan 2025 06:30:11 GMT
- Title: GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
- Authors: Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou, Yujia Lu, Lingyun Sun,
- Abstract summary: We present General Video-to-Music Generation model (GVMGen) for generating high-related music to the video input.
Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions.
Our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios.
- Score: 13.9134271174972
- License:
- Abstract: Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.
Related papers
- UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [32.741262543860934]
We present a framework for learning to generate background music from video inputs.
We develop a generative video-music Transformer with a novel semantic video-music alignment scheme.
New temporal video encoder architecture allows us to efficiently process videos consisting of many densely sampled frames.
arXiv Detail & Related papers (2024-09-11T17:56:48Z) - VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [71.01050359126141]
We propose VidMuse, a framework for generating music aligned with video inputs.
VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z) - Video2Music: Suitable Music Generation from Videos using an Affective
Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video.
In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [47.076283429992664]
V2Meow is a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types.
It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames.
arXiv Detail & Related papers (2023-05-11T06:26:41Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Video Background Music Generation: Dataset, Method and Evaluation [31.15901120245794]
We introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation.
We present SymMV, a video and symbolic music dataset with various musical annotations.
We also propose a benchmark video background music generation framework named V-MusProd.
arXiv Detail & Related papers (2022-11-21T08:39:48Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.