Related papers: Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

URL: http://arxiv.org/abs/2412.09428v1
Date: Thu, 12 Dec 2024 16:33:21 GMT
Title: Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
Authors: Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu,
Abstract summary: Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images.<n>Existing methods use a common embedding space for multimodal fusion.<n>This paper addresses these issues by using explicit bridges of text and music for multimodal alignment.
Score: 21.359220595049788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.

Related papers

Cross-Modal Learning for Music-to-Music-Video Description Generation [22.27153318775917]
Music-to-music-video (MV) generation is a challenging task due to intrinsic differences between the music and video modalities. In this study, we focus on the MV description generation task and propose a comprehensive pipeline. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset.
arXiv Detail & Related papers (2025-03-14T08:34:28Z)
AudioX: Diffusion Transformer for Anything-to-Audio Generation [72.84633243365093]
AudioX is a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. It can generate both general audio and music with high quality, while offering flexible natural language control. To address data scarcity, we curate two datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset.
arXiv Detail & Related papers (2025-03-13T16:30:59Z)
DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning [20.072410431003142]
DeepResonance is a multimodal music understanding model fine-tuned for multi-way aligned music, text, image, and video data. We construct datasets designed to enable DeepResonance to integrate both visual and textual music feature content. Our model achieves state-of-the-art performances across six music understanding tasks.
arXiv Detail & Related papers (2025-02-18T08:09:42Z)
Learning Musical Representations for Music Performance Question Answering [10.912207282129753]
multimodal learning methods are incapable of dealing with fundamental problems within the music performances. Our primary backbone is designed to incorporate multimodal interactions within the context of music data. Our experiments show state-of-the-art effects on the Music AVQA datasets.
arXiv Detail & Related papers (2025-02-10T17:41:57Z)
UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z)
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [71.01050359126141]
We propose VidMuse, a framework for generating music aligned with video inputs. VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z)
Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [9.311353871322325]
Mozart's Touch is a framework capable of generating music aligned with cross-modal inputs such as images, videos, and text. Unlike traditional end-to-end methods, Mozart's Touch uses LLMs to accurately interpret visual elements without requiring the training or fine-tuning of music generation models.
arXiv Detail & Related papers (2024-05-05T03:15:52Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.