CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
- URL: http://arxiv.org/abs/2506.12285v2
- Date: Fri, 27 Jun 2025 22:42:09 GMT
- Title: CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
- Authors: Yinghao Ma, Siyou Li, Juntao Yu, Emmanouil Benetos, Akira Maezawa,
- Abstract summary: CMI-Bench is a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks.<n>Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models.<n>We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc.
- Score: 12.638115555721257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.
Related papers
- Advancing the Foundation Model for Music Understanding [9.210248657997687]
We introduce a unified foundation model named MuFun for holistic music understanding.<n>Our model features a novel architecture that jointly processes instrumental and lyrical content.<n>We also propose a new benchmark for multi-faceted music understanding called MuCUE.
arXiv Detail & Related papers (2025-08-02T03:33:47Z) - CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining [15.58671300364536]
This paper presents a novel cross-modal contrastive learning framework to guide music similarity modeling.<n>To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach.<n>Experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks.
arXiv Detail & Related papers (2025-03-29T15:43:09Z) - Exploring GPT's Ability as a Judge in Music Understanding [6.5178028874627705]
We use a systematic prompt engineering approach for Large Language Models to solve music information retrieval problems.<n>We convert the music data to symbolic inputs and evaluate LLMs' ability in detecting annotation errors in three key MIR tasks.<n>A concept augmentation method is proposed to evaluate LLMs' music reasoning consistency with the provided music concepts in the prompts.
arXiv Detail & Related papers (2025-01-22T22:49:27Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.<n>Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration.<n>To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z) - The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models [63.53530525014976]
ZIQI-Eval is a benchmark specifically designed to evaluate the music-related capabilities of large language models (LLMs)
ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries.
Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities.
arXiv Detail & Related papers (2024-06-22T16:24:42Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.