Related papers: ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following

ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following

URL: http://arxiv.org/abs/2509.23350v1
Date: Sat, 27 Sep 2025 14:56:20 GMT
Title: ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following
Authors: Jiahao Zhao, Yunjia Li, Wei Li, Kazuyoshi Yoshii,
Abstract summary: We propose ABC-Eval, the first open-source benchmark dedicated to the understanding and instruction-following capabilities in text-based ABC notation scores.<n>It comprises 1,086 test samples spanning 10 sub-tasks, covering scenarios from basic musical syntax comprehension to complex sequence-level reasoning.<n>We evaluate seven state-of-the-art LLMs on ABC-Eval, and the results reveal notable limitations in existing models' symbolic music processing capabilities.
Score: 8.668922435342054
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As large language models continue to develop, the feasibility and significance of text-based symbolic music tasks have become increasingly prominent. While symbolic music has been widely used in generation tasks, LLM capabilities in understanding and reasoning about symbolic music remain largely underexplored. To address this gap, we propose ABC-Eval, the first open-source benchmark dedicated to the understanding and instruction-following capabilities in text-based ABC notation scores. It comprises 1,086 test samples spanning 10 sub-tasks, covering scenarios from basic musical syntax comprehension to complex sequence-level reasoning. Such a diverse scope poses substantial challenges to models' ability to handle symbolic music tasks. We evaluated seven state-of-the-art LLMs on ABC-Eval, and the results reveal notable limitations in existing models' symbolic music processing capabilities. Furthermore, the consistent performance of individual baselines across different sub-tasks supports the reliability of our benchmark.

Related papers

How Far Can Pretrained LLMs Go in Symbolic Music? Controlled Comparisons of Supervised and Preference-based Adaptation [15.849579727945153]
Music often shares notable parallels with language, motivating the use of pretrained large language models (LLMs) for symbolic music understanding and generation.<n>We present a comparative study of finetuning strategies for ABC-based generation and understanding, comparing an off-the-shelf instruction-tuned backbone to domain-adapted variants.<n>We highlight the domain adaptation vs.preserving prior information tradeoff as well as the distinct behaviour of metrics used to measure the domain adaptation for symbolic music.
arXiv Detail & Related papers (2026-01-30T09:44:01Z)
SongSage: A Large Musical Language Model with Lyric Generative Pre-training [69.52790104805794]
SongSage is a large musical language model equipped with diverse lyric-centric intelligence through lyric generative pretraining.<n>SongSage exhibits a strong understanding of lyric-centric knowledge, excels in rewriting user queries for zero-shot playlist recommendations, generates and continues lyrics effectively, and performs proficiently across seven additional capabilities.
arXiv Detail & Related papers (2026-01-03T10:54:37Z)
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores [32.722200962820125]
We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for evaluating score-level musical understanding.<n>MSU-Bench comprises 1,800 generative question-answer (QA) pairs drawn from works spanning Bach, Beethoven, Chopin, Debussy, and others.<n>We reveal sharp modality gaps, fragile level-wise success rates, and the difficulty of sustaining multilevel correctness.
arXiv Detail & Related papers (2025-11-24T06:40:38Z)
Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music [50.87225308217594]
This paper presents an unsupervised machine learning algorithm that identifies recurring patterns -- referred to as music-words'' -- from symbolic music data.<n>We formulate the task of music-word discovery as a statistical optimization problem and propose a two-stage Expectation-Maximization (EM)-based learning framework.
arXiv Detail & Related papers (2025-09-29T11:10:57Z)
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning [31.460197795186048]
We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark.<n>Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions.<n>We frame complex music reasoning as multiple-choice question, enabling controlled and scalable assessment of MLLMs' symbolic music understanding.
arXiv Detail & Related papers (2025-09-05T01:54:50Z)
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning [69.78158549955384]
We introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions.<n>This approach generates verifiable sheet music questions in both textual and visual modalities.<n> Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music.
arXiv Detail & Related papers (2025-09-04T09:42:17Z)
Large Language Models' Internal Perception of Symbolic Music [3.9901365062418317]
Large language models (LLMs) excel at modeling relationships between strings in natural language.<n>This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts.
arXiv Detail & Related papers (2025-07-17T05:48:45Z)
Semantic-Aware Interpretable Multimodal Music Auto-Tagging [1.8541450825478398]
We present an interpretable framework for music auto-tagging that leverages groups of musically meaningful multimodal features.<n>Our method achieves competitive tagging performance while offering a deeper understanding of the decision-making process.
arXiv Detail & Related papers (2025-05-22T19:15:48Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community. There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions. We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation. Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.