Supervised contrastive learning from weakly-labeled audio segments for musical version matching
- URL: http://arxiv.org/abs/2502.16936v2
- Date: Fri, 04 Apr 2025 06:20:46 GMT
- Title: Supervised contrastive learning from weakly-labeled audio segments for musical version matching
- Authors: Joan SerrĂ , R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji,
- Abstract summary: We propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives.<n>With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation.
- Score: 21.88094295569794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.
Related papers
- Partitioned Saliency Ranking with Dense Pyramid Transformers [4.449304130658638]
Saliency ranking has emerged as a challenging task focusing on assessing the degree of saliency at instance-level.
Previous approaches undertake the saliency ranking by directly sorting the rank scores of salient instances, which have not explicitly resolved the inherent ambiguities.
We propose the ranking by partition paradigm, which segments unordered salient instances into partitions and then ranks them based on the correlations among these partitions.
arXiv Detail & Related papers (2023-08-01T02:33:10Z) - Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier.
Our method is model-agnostic and can be easily applied to generic segmentation models.
With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z) - Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning [35.51193811629467]
Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes.
While all approaches currently are based on meta-learning, they perform poorly and saturate in learning after observing only a few shots.
We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-art results on two datasets.
arXiv Detail & Related papers (2021-12-21T04:44:57Z) - Unsupervised Learning of Deep Features for Music Segmentation [8.528384027684192]
Music segmentation is a problem of identifying boundaries between, and labeling, distinct music segments.
The performance of a range of music segmentation algorithms has been dependent on the audio features chosen to represent the audio.
In this work, unsupervised training of deep feature embeddings using convolutional neural networks (CNNs) is explored for music segmentation.
arXiv Detail & Related papers (2021-08-30T01:55:44Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - A Weakly-Supervised Semantic Segmentation Approach based on the Centroid
Loss: Application to Quality Control and Inspection [6.101839518775968]
We propose and assess a new weakly-supervised semantic segmentation approach making use of a novel loss function.
The performance of the approach is evaluated against datasets from two different industry-related case studies.
arXiv Detail & Related papers (2020-10-26T09:08:21Z) - Dynamic Semantic Matching and Aggregation Network for Few-shot Intent
Detection [69.2370349274216]
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances.
Semantic components are distilled from utterances via multi-head self-attention.
Our method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances.
arXiv Detail & Related papers (2020-10-06T05:16:38Z) - Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels.
To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit.
Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z) - Equalization Loss for Long-Tailed Object Recognition [109.91045951333835]
State-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets.
We propose a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories.
Our method achieves AP gains of 4.1% and 4.8% for the rare and common categories on the challenging LVIS benchmark.
arXiv Detail & Related papers (2020-03-11T09:14:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.