MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
- URL: http://arxiv.org/abs/2408.01337v1
- Date: Fri, 2 Aug 2024 15:34:05 GMT
- Title: MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
- Authors: Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, Dmitry Bogdanov,
- Abstract summary: MuChoMusic is a benchmark for evaluating music understanding in multimodal language models focused on audio.
It comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets.
We evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality.
- Score: 11.834712543531756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.
Related papers
- CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models [51.03510073676228]
CLaMP 2 is a system compatible with 101 languages for music information retrieval.
By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale.
CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities.
arXiv Detail & Related papers (2024-10-17T06:43:54Z) - A Survey of Foundation Models for Music Understanding [60.83532699497597]
This work is one of the early reviews of the intersection of AI techniques and music understanding.
We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities.
arXiv Detail & Related papers (2024-09-15T03:34:14Z) - Foundation Models for Music: A Survey [77.77088584651268]
Foundations models (FMs) have profoundly impacted diverse sectors, including music.
This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music.
arXiv Detail & Related papers (2024-08-26T15:13:14Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music [21.380568107727207]
We present MuChin, the first open-source music description benchmark in Chinese colloquial language.
MuChin is designed to evaluate the performance of multimodal Large Language Models in understanding and describing music.
All data related to the benchmark, along with the scoring code and detailed appendices, have been open-sourced.
arXiv Detail & Related papers (2024-02-15T10:55:01Z) - Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types.
Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning.
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z) - LLark: A Multimodal Instruction-Following Language Model for Music [7.7033394687966865]
Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand.
We present LLark, an instruction-tuned multimodal model for emphmusic understanding.
arXiv Detail & Related papers (2023-10-11T03:12:47Z) - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking
Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition.
We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z) - MuLan: A Joint Embedding of Music Audio and Natural Language [15.753767984842014]
This paper presents a new generation of models that link audio annotations directly to natural language descriptions.
MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings.
arXiv Detail & Related papers (2022-08-26T03:13:21Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.