Related papers: Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis

Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis

URL: http://arxiv.org/abs/2512.14602v1
Date: Tue, 16 Dec 2025 17:12:26 GMT
Title: Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis
Authors: Lukáš Samuel Marták, Patricia Hu, Gerhard Widmer,
Abstract summary: This work investigates the musical dimension -- specifically, variations in genre, dynamics, and polyphony levels.<n>We introduce the MDS corpus, comprising three distinct subsets -- Genre, (2) Random, and (3) MAEtest.<n>We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics.
Score: 6.87202900256721
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic Music Transcription (AMT) -- the task of converting music audio into note representations -- has seen rapid progress, driven largely by deep learning systems. Due to the limited availability of richly annotated music datasets, much of the progress in AMT has been concentrated on classical piano music, and even a few very specific datasets. Whether these systems can generalize effectively to other musical contexts remains an open question. Complementing recent studies on distribution shifts in sound (e.g., recording conditions), in this work we investigate the musical dimension -- specifically, variations in genre, dynamics, and polyphony levels. To this end, we introduce the MDS corpus, comprising three distinct subsets -- (1) Genre, (2) Random, and (3) MAEtest -- to emulate different axes of distribution shift. We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics. Our extensive evaluation isolates and exposes varying degrees of performance degradation under specific distribution shifts. In particular, we measure a note-level F1 performance drop of 20 percentage points due to sound, and 14 due to genre. Generally, we find that dynamics estimation proves more vulnerable to musical variation than onset prediction. Musically informed evaluation metrics, particularly those capturing harmonic structure, help identify potential contributing factors. Furthermore, experiments with randomly generated, non-musical sequences reveal clear limitations in system performance under extreme musical distribution shifts. Altogether, these findings offer new evidence of the persistent impact of the Corpus Bias problem in deep AMT systems.

Related papers

BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning [74.84822135705025]
We introduce BASS, designed to evaluate music understanding and reasoning in audio language models.<n>BASS comprises 2658 questions spanning 12 tasks, unique 1993 songs and covering over 138 hours of music.<n>We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks.
arXiv Detail & Related papers (2026-02-03T23:40:31Z)
A Study on the Data Distribution Gap in Music Emotion Recognition [7.281487567929003]
Music Emotion Recognition (MER) is a task deeply connected to human perception.<n>Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres.<n>We address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations.
arXiv Detail & Related papers (2025-10-06T10:57:05Z)
High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z)
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning [69.78158549955384]
We introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions.<n>This approach generates verifiable sheet music questions in both textual and visual modalities.<n> Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music.
arXiv Detail & Related papers (2025-09-04T09:42:17Z)
Progressive Rock Music Classification [0.0]
This study investigates the classification of progressive rock music, a genre characterized by complex compositions and diverse instrumentation.<n>We extracted comprehensive audio features, including spectrograms, Mel-Frequency Cepstral Coefficients (MFCCs), chromagrams, and beat positions from song snippets.<n>A winner-take-all voting strategy was employed to aggregate snippet-level predictions into final song classifications.
arXiv Detail & Related papers (2025-04-15T02:48:52Z)
Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems [3.5570874721859016]
Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. We identify two primary sources of distribution shift: the music, and the sound. We evaluate the performance of several SotA AMT systems on two new experimental test sets.
arXiv Detail & Related papers (2024-08-08T19:40:28Z)
Towards Explainable and Interpretable Musical Difficulty Estimation: A Parameter-efficient Approach [49.2787113554916]
Estimating music piece difficulty is important for organizing educational music collections. Our work employs explainable descriptors for difficulty estimation in symbolic music representations. Our approach, evaluated in piano repertoire categorized in 9 classes, achieved 41.4% accuracy independently, with a mean squared error (MSE) of 1.7.
arXiv Detail & Related papers (2024-08-01T11:23:42Z)
A Perceptual Measure for Evaluating the Resynthesis of Automatic Music Transcriptions [10.957528713294874]
This study focuses on the perception of music performances when contextual factors, such as room acoustics and instrument, change. We propose to distinguish the concept of "performance" from the one of "interpretation", which expresses the "artistic intention"
arXiv Detail & Related papers (2022-02-24T18:09:22Z)
Towards Cross-Cultural Analysis using Music Information Dynamics [7.4517333921953215]
Music from different cultures establish different aesthetics by having different style conventions on two aspects. We propose a framework that could be used to quantitatively compare music from different cultures by looking at these two aspects.
arXiv Detail & Related papers (2021-11-24T16:05:29Z)
Sequence Generation using Deep Recurrent Networks and Embeddings: A study case in music [69.2737664640826]
This paper evaluates different types of memory mechanisms (memory cells) and analyses their performance in the field of music composition. A set of quantitative metrics is presented to evaluate the performance of the proposed architecture automatically.
arXiv Detail & Related papers (2020-12-02T14:19:19Z)
Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective. The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone. The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.