Related papers: WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

URL: http://arxiv.org/abs/2509.04744v1
Date: Fri, 05 Sep 2025 01:54:50 GMT
Title: WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Authors: Gagan Mundada, Yash Vishe, Amit Namburi, Xin Xu, Zachary Novack, Julian McAuley, Junda Wu,
Abstract summary: We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark.<n>Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions.<n>We frame complex music reasoning as multiple-choice question, enabling controlled and scalable assessment of MLLMs' symbolic music understanding.
Score: 31.460197795186048
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

Related papers

BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning [74.84822135705025]
We introduce BASS, designed to evaluate music understanding and reasoning in audio language models.<n>BASS comprises 2658 questions spanning 12 tasks, unique 1993 songs and covering over 138 hours of music.<n>We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks.
arXiv Detail & Related papers (2026-02-03T23:40:31Z)
ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following [8.668922435342054]
We propose ABC-Eval, the first open-source benchmark dedicated to the understanding and instruction-following capabilities in text-based ABC notation scores.<n>It comprises 1,086 test samples spanning 10 sub-tasks, covering scenarios from basic musical syntax comprehension to complex sequence-level reasoning.<n>We evaluate seven state-of-the-art LLMs on ABC-Eval, and the results reveal notable limitations in existing models' symbolic music processing capabilities.
arXiv Detail & Related papers (2025-09-27T14:56:20Z)
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning [69.78158549955384]
We introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions.<n>This approach generates verifiable sheet music questions in both textual and visual modalities.<n> Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music.
arXiv Detail & Related papers (2025-09-04T09:42:17Z)
Large Language Models' Internal Perception of Symbolic Music [3.9901365062418317]
Large language models (LLMs) excel at modeling relationships between strings in natural language.<n>This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts.
arXiv Detail & Related papers (2025-07-17T05:48:45Z)
MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models [45.2560094901105]
MusiXQA is the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding.<n>We develop Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods.
arXiv Detail & Related papers (2025-06-28T20:46:47Z)
CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following [12.638115555721257]
CMI-Bench is a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks.<n>Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models.<n>We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc.
arXiv Detail & Related papers (2025-06-14T00:18:44Z)
Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation [48.462734327375536]
Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects.<n>Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form.<n>We propose M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding.<n>Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities.
arXiv Detail & Related papers (2025-06-02T04:00:35Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation [31.825105824490464]
Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) to the symbolic music domain. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing.
arXiv Detail & Related papers (2024-07-31T11:29:46Z)
The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models [63.53530525014976]
ZIQI-Eval is a benchmark specifically designed to evaluate the music-related capabilities of large language models (LLMs) ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. Results indicate that all LLMs perform poorly on the ZIQI-Eval benchmark, suggesting significant room for improvement in their musical capabilities.
arXiv Detail & Related papers (2024-06-22T16:24:42Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.