Related papers: From Discord to Harmony: Decomposed Consonance-based Training for Improved Audio Chord Estimation

From Discord to Harmony: Decomposed Consonance-based Training for Improved Audio Chord Estimation

URL: http://arxiv.org/abs/2509.01588v1
Date: Mon, 01 Sep 2025 16:20:47 GMT
Title: From Discord to Harmony: Decomposed Consonance-based Training for Improved Audio Chord Estimation
Authors: Andrea Poltronieri, Xavier Serra, Martín Rocamora,
Abstract summary: This paper presents an evaluation of inter-annotator agreement in chord annotations, using metrics that extend beyond traditional binary measures.<n>We introduce a novel ACE conformer-based model that integrates consonance concepts into the model through consonance-based label smoothing.
Score: 9.584152437544974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio Chord Estimation (ACE) holds a pivotal role in music information research, having garnered attention for over two decades due to its relevance for music transcription and analysis. Despite notable advancements, challenges persist in the task, particularly concerning unique characteristics of harmonic content, which have resulted in existing systems' performances reaching a glass ceiling. These challenges include annotator subjectivity, where varying interpretations among annotators lead to inconsistencies, and class imbalance within chord datasets, where certain chord classes are over-represented compared to others, posing difficulties in model training and evaluation. As a first contribution, this paper presents an evaluation of inter-annotator agreement in chord annotations, using metrics that extend beyond traditional binary measures. In addition, we propose a consonance-informed distance metric that reflects the perceptual similarity between harmonic annotations. Our analysis suggests that consonance-based distance metrics more effectively capture musically meaningful agreement between annotations. Expanding on these findings, we introduce a novel ACE conformer-based model that integrates consonance concepts into the model through consonance-based label smoothing. The proposed model also addresses class imbalance by separately estimating root, bass, and all note activations, enabling the reconstruction of chord labels from decomposed outputs.

Related papers

Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization [2.087792589220897]
We introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps.<n>We systematically evaluate this approach against prior curricula across multiple experimental axes.<n>Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics.
arXiv Detail & Related papers (2026-01-22T17:46:31Z)
Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data [67.25796812343454]
Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise.<n>We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights.<n>Experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
arXiv Detail & Related papers (2025-10-09T13:05:27Z)
Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation [51.67328507400985]
In noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators.<n>This paper focuses on learning from partial labels annotated by pre-trained vision-language models.<n>It proposes an innovative collaborative consistency regularization (Co-Reg) method.
arXiv Detail & Related papers (2025-06-03T12:48:54Z)
Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model [6.085444830169205]
Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes. We introduce the Music-Diff architecture, which fits a joint distribution of notes and semantic information to generate symbolic music conditionally.
arXiv Detail & Related papers (2024-08-04T07:38:38Z)
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations [17.218899140175697]
COCOLA is a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples.<n>Our method operates at the level of the stems composing music tracks and can input features obtained via Harmonic-Percussive Separation (HPS)
arXiv Detail & Related papers (2024-04-25T18:42:25Z)
Serenade: A Model for Human-in-the-loop Automatic Chord Estimation [1.6385815610837167]
We show that a human-in-the-loop approach improves harmonic analysis performance over a model-only approach. We evaluate our model on a dataset of popular music and show that, with this human-in-the-loop approach, harmonic analysis performance improves over a model-only approach.
arXiv Detail & Related papers (2023-10-17T11:31:29Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
What You Hear Is What You See: Audio Quality Metrics From Image Quality Metrics [44.659718609385315]
We investigate the feasibility of utilizing state-of-the-art image perceptual metrics for evaluating audio signals by representing them as spectrograms. We customise one of the metrics which has a psychoacoustically plausible architecture to account for the peculiarities of sound signals. We evaluate the effectiveness of our proposed metric and several baseline metrics using a music dataset.
arXiv Detail & Related papers (2023-05-19T10:43:57Z)
SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance [88.0355290619761]
This work focuses on the separation of unknown musical instruments. We propose the Separation-with-Consistency (SeCo) framework, which can accomplish the separation on unknown categories. Our framework exhibits strong adaptation ability on the novel musical categories and outperforms the baseline methods by a significant margin.
arXiv Detail & Related papers (2022-03-25T09:42:11Z)
A Perceptual Measure for Evaluating the Resynthesis of Automatic Music Transcriptions [10.957528713294874]
This study focuses on the perception of music performances when contextual factors, such as room acoustics and instrument, change. We propose to distinguish the concept of "performance" from the one of "interpretation", which expresses the "artistic intention"
arXiv Detail & Related papers (2022-02-24T18:09:22Z)
Audio Impairment Recognition Using a Correlation-Based Feature Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.