DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
- URL: http://arxiv.org/abs/2502.12623v1
- Date: Tue, 18 Feb 2025 08:09:42 GMT
- Title: DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
- Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji,
- Abstract summary: DeepResonance is a multimodal music understanding model fine-tuned for multi-way aligned music, text, image, and video data.
We construct datasets designed to enable DeepResonance to integrate both visual and textual music feature content.
Our model achieves state-of-the-art performances across six music understanding tasks.
- Score: 20.072410431003142
- License:
- Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-alignment Transformer to enhance modality fusion prior to input into text LLMs, tailoring DeepResonance for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We plan to open-source the models and the newly constructed datasets.
Related papers
- Learning Musical Representations for Music Performance Question Answering [10.912207282129753]
multimodal learning methods are incapable of dealing with fundamental problems within the music performances.
Our primary backbone is designed to incorporate multimodal interactions within the context of music data.
Our experiments show state-of-the-art effects on the Music AVQA datasets.
arXiv Detail & Related papers (2025-02-10T17:41:57Z) - Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation [21.359220595049788]
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images.
Existing methods use a common embedding space for multimodal fusion.
This paper addresses these issues by using explicit bridges of text and music for multimodal alignment.
arXiv Detail & Related papers (2024-12-12T16:33:21Z) - A Survey of Foundation Models for Music Understanding [60.83532699497597]
This work is one of the early reviews of the intersection of AI techniques and music understanding.
We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities.
arXiv Detail & Related papers (2024-09-15T03:34:14Z) - Foundation Models for Music: A Survey [77.77088584651268]
Foundations models (FMs) have profoundly impacted diverse sectors, including music.
This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music.
arXiv Detail & Related papers (2024-08-26T15:13:14Z) - MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models [11.834712543531756]
MuChoMusic is a benchmark for evaluating music understanding in multimodal language models focused on audio.
It comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets.
We evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality.
arXiv Detail & Related papers (2024-08-02T15:34:05Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - ComposerX: Multi-Agent Symbolic Music Composition with LLMs [51.68908082829048]
Music composition is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints.
Current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and Chain-of-Thoughts.
We propose ComposerX, an agent-based symbolic music generation framework.
arXiv Detail & Related papers (2024-04-28T06:17:42Z) - ChatMusician: Understanding and Generating Music Intrinsically with LLM [81.48629006702409]
ChatMusician is an open-source Large Language Models (LLMs) that integrates intrinsic musical abilities.
It can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers.
Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc.
arXiv Detail & Related papers (2024-02-25T17:19:41Z) - Video2Music: Suitable Music Generation from Videos using an Affective
Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video.
In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z) - Music Understanding LLaMA: Advancing Text-to-Music Generation with
Question Answering and Captioning [37.76488341368786]
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions.
We propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files.
We present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA dataset.
arXiv Detail & Related papers (2023-08-22T08:43:33Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.