LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
- URL: http://arxiv.org/abs/2510.08580v1
- Date: Tue, 16 Sep 2025 02:15:06 GMT
- Title: LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
- Authors: Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu,
- Abstract summary: This paper introduces textitLadderSym, a novel Transformer-based method for music error detection.<n>textitLadderSym is guided by two key observations about the state-of-the-art approaches.<n>We evaluate our method on the textitMAESTRO-E and textitCocoChorales-E datasets by measuring the F1 score for each note category.
- Score: 6.949059287049708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces \textit{LadderSym}, a novel Transformer-based method for music error detection. \textit{LadderSym} is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, \textit{LadderSym} introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the \textit{MAESTRO-E} and \textit{CocoChorales-E} datasets by measuring the F1 score for each note category. Compared to the previous state of the art, \textit{LadderSym} more than doubles F1 for missed notes on \textit{MAESTRO-E} (26.8\% $\rightarrow$ 56.3\%) and improves extra note detection by 14.4 points (72.0\% $\rightarrow$ 86.4\%). Similar gains are observed on \textit{CocoChorales-E}. This work introduces general insights about comparison models that could inform sequence evaluation tasks for reinforcement Learning, human skill assessment, and model evaluation.
Related papers
- AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech [56.08149157180447]
We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models.<n>We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework.
arXiv Detail & Related papers (2026-02-27T03:33:37Z) - RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection [17.45655063331199]
RUMAA is a transformer-based framework for music performance analysis.<n>It unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner.
arXiv Detail & Related papers (2025-07-16T12:13:13Z) - Supervised contrastive learning from weakly-labeled audio segments for musical version matching [21.88094295569794]
We propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives.<n>With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation.
arXiv Detail & Related papers (2025-02-24T08:01:40Z) - Detecting Music Performance Errors with Transformers [3.6837762419929168]
Existing tools for music error detection rely on automatic alignment.<n>There is a lack of sufficient data to train music error detection models.<n>We present a novel data generation technique capable of creating large-scale synthetic music error datasets.
arXiv Detail & Related papers (2025-01-03T07:04:20Z) - Just Label the Repeats for In-The-Wild Audio-to-Score Alignment [7.7805314458791806]
We propose an efficient workflow for alignment of in-the-wild performance audio and corresponding sheet music scans (images)
We show that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work.
arXiv Detail & Related papers (2024-11-11T23:05:02Z) - Exploring Tokenization Methods for Multitrack Sheet Music Generation [48.8206920811097]
This study explores the tokenization of multitrack sheet music in ABC notation.
In terms of both computational efficiency and musicality, experimental results show that bar-stream patching performs best overall.
arXiv Detail & Related papers (2024-10-23T06:19:48Z) - End-to-end Piano Performance-MIDI to Score Conversion with Transformers [26.900974153235456]
We present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files.
We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data.
Our method is also the first to directly predict notational details like trill marks or stem direction from performance data.
arXiv Detail & Related papers (2024-09-30T20:11:37Z) - Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats.
One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image.
We introduce a music object detector based on YOLOv8, which improves detection performance.
Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z) - Noisy Pair Corrector for Dense Retrieval [59.312376423104055]
We propose a novel approach called Noisy Pair Corrector (NPC)
NPC consists of a detection module and a correction module.
We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS.
arXiv Detail & Related papers (2023-11-07T08:27:14Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - Benchmarks and leaderboards for sound demixing tasks [44.99833362998488]
We introduce two new benchmarks for the sound source separation tasks.
We compare popular models for sound demixing, as well as their ensembles, on these benchmarks.
We also develop a novel approach for audio separation, based on the ensembling of different models that are suited best for the particular stem.
arXiv Detail & Related papers (2023-05-12T14:00:26Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - An Comparative Analysis of Different Pitch and Metrical Grid Encoding
Methods in the Task of Sequential Music Generation [4.941630596191806]
This paper presents an analysis of the influence of pitch and meter on the performance of a token-based sequential music generation model.
For complexity, the single token approach and the multiple token approach are compared; for grid resolution, 0 (ablation), 1 (bar-level), 4 (downbeat-level) 12, (8th-triplet-level) up to 64 (64th-note-grid-level) up to 16 subdivisions per beat are compared.
Results suggest that the class-octave encoding significantly outperforms the taken-for-granted MIDI encoding on pitch-related metrics.
arXiv Detail & Related papers (2023-01-31T03:19:50Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.