Related papers: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

URL: http://arxiv.org/abs/2406.04321v2
Date: Sun, 13 Oct 2024 17:59:22 GMT
Title: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Authors: Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo,
Abstract summary: We propose VidMuse, a framework for generating music aligned with video inputs. VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
Score: 71.01050359126141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at https://github.com/ZeyueT/VidMuse/.

Related papers

Let Your Video Listen to Your Music! [62.27731415767459]
We propose a novel framework, MVAA, that automatically edits video to align with the rhythm of a given music track.<n>We modularize the task into a two-step process in our MVAA: aligning motion with audio beats, followed by rhythm-aware video editing.<n>This hybrid approach enables adaptation within 10 minutes with one on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone.
arXiv Detail & Related papers (2025-06-23T17:52:16Z)
Audio-Sync Video Generation with Multi-Stream Temporal Control [64.00019697525322]
We introduce MTV, a versatile framework for video generation with precise audio-visual synchronization.<n>MTV separates audios into speech, effects, and tracks, enabling control over lip motion, event timing, and visual mood.<n>To support the framework, we additionally present DEmix, a dataset of high-quality cinematic videos and demixed audio tracks.
arXiv Detail & Related papers (2025-06-09T17:59:42Z)
MusicInfuser: Making Video Diffusion Listen and Dance [20.41612388764672]
MusicInfuser is an approach for generating high-quality dance videos synchronized to a specified music track.<n>We show how existing video diffusion models can be adapted to align with musical inputs.
arXiv Detail & Related papers (2025-03-18T17:59:58Z)
GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions [13.9134271174972]
We present General Video-to-Music Generation model (GVMGen) for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions. Our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios.
arXiv Detail & Related papers (2025-01-17T06:30:11Z)
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [32.741262543860934]
We present a framework for learning to generate background music from video inputs. We develop a generative video-music Transformer with a novel semantic video-music alignment scheme. New temporal video encoder architecture allows us to efficiently process videos consisting of many densely sampled frames.
arXiv Detail & Related papers (2024-09-11T17:56:48Z)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions. We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks. Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z)
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z)
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [47.076283429992664]
V2Meow is a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames.
arXiv Detail & Related papers (2023-05-11T06:26:41Z)
Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)
Video Background Music Generation: Dataset, Method and Evaluation [31.15901120245794]
We introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation. We present SymMV, a video and symbolic music dataset with various musical annotations. We also propose a benchmark video background music generation framework named V-MusProd.
arXiv Detail & Related papers (2022-11-21T08:39:48Z)
Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG) APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.