BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning
- URL: http://arxiv.org/abs/2602.04085v1
- Date: Tue, 03 Feb 2026 23:40:31 GMT
- Title: BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning
- Authors: Min Jang, Orevaoghene Ahia, Nazif Tamer, Sachin Kumar, Yulia Tsvetkov, Noah A. Smith,
- Abstract summary: We introduce BASS, designed to evaluate music understanding and reasoning in audio language models.<n>BASS comprises 2658 questions spanning 12 tasks, unique 1993 songs and covering over 138 hours of music.<n>We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks.
- Score: 74.84822135705025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music understanding is a complex task that often requires reasoning over both structural and semantic elements of audio. We introduce BASS, designed to evaluate music understanding and reasoning in audio language models across four broad categories: structural segmentation, lyric transcription, musicological analysis, and artist collaboration. BASS comprises 2658 questions spanning 12 tasks, 1993 unique songs and covering over 138 hours of music from a wide range of genres and tracks, crafted to assess musicological knowledge and reasoning in real-world scenarios. We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks such as structural segmentation and artist collaboration, while performing best on lyric transcription. Our analysis reveals that current models leverage linguistic priors effectively but remain limited in reasoning over musical structure, vocal, and musicological attributes. BASS provides an evaluation framework with widespread applications in music recommendation and search and has the potential to guide the development of audio LMs.
Related papers
- Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores [32.722200962820125]
We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for evaluating score-level musical understanding.<n>MSU-Bench comprises 1,800 generative question-answer (QA) pairs drawn from works spanning Bach, Beethoven, Chopin, Debussy, and others.<n>We reveal sharp modality gaps, fragile level-wise success rates, and the difficulty of sustaining multilevel correctness.
arXiv Detail & Related papers (2025-11-24T06:40:38Z) - Music Flamingo: Scaling Music Understanding in Audio Language Models [98.94537017112704]
Music Flamingo is a novel large audio-language model designed to advance music understanding in foundational audio models.<n> MF-Skills is a dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context.<n>We introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards.
arXiv Detail & Related papers (2025-11-13T13:21:09Z) - Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music [50.87225308217594]
This paper presents an unsupervised machine learning algorithm that identifies recurring patterns -- referred to as music-words'' -- from symbolic music data.<n>We formulate the task of music-word discovery as a statistical optimization problem and propose a two-stage Expectation-Maximization (EM)-based learning framework.
arXiv Detail & Related papers (2025-09-29T11:10:57Z) - Advancing the Foundation Model for Music Understanding [9.210248657997687]
We introduce a unified foundation model named MuFun for holistic music understanding.<n>Our model features a novel architecture that jointly processes instrumental and lyrical content.<n>We also propose a new benchmark for multi-faceted music understanding called MuCUE.
arXiv Detail & Related papers (2025-08-02T03:33:47Z) - Learning Musical Representations for Music Performance Question Answering [10.912207282129753]
multimodal learning methods are incapable of dealing with fundamental problems within the music performances.<n>Our primary backbone is designed to incorporate multimodal interactions within the context of music data.<n>Our experiments show state-of-the-art effects on the Music AVQA datasets.
arXiv Detail & Related papers (2025-02-10T17:41:57Z) - Evaluation of pretrained language models on music understanding [0.0]
We demonstrate that Large Language Models (LLM) suffer from 1) prompt sensitivity, 2) inability to model negation, and 3) sensitivity towards the presence of specific words.
We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology.
Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.
arXiv Detail & Related papers (2024-09-17T14:44:49Z) - A Survey of Foundation Models for Music Understanding [60.83532699497597]
This work is one of the early reviews of the intersection of AI techniques and music understanding.
We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities.
arXiv Detail & Related papers (2024-09-15T03:34:14Z) - MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models [11.834712543531756]
MuChoMusic is a benchmark for evaluating music understanding in multimodal language models focused on audio.
It comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets.
We evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality.
arXiv Detail & Related papers (2024-08-02T15:34:05Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.