Toward Fully Self-Supervised Multi-Pitch Estimation
- URL: http://arxiv.org/abs/2402.15569v1
- Date: Fri, 23 Feb 2024 19:12:41 GMT
- Title: Toward Fully Self-Supervised Multi-Pitch Estimation
- Authors: Frank Cwitkowitz and Zhiyao Duan
- Abstract summary: We present a suite of self-supervised learning objectives for multi-pitch estimation.
These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly.
Our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
- Score: 21.000057864087164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-pitch estimation is a decades-long research problem involving the
detection of pitch activity associated with concurrent musical events within
multi-instrument mixtures. Supervised learning techniques have demonstrated
solid performance on more narrow characterizations of the task, but suffer from
limitations concerning the shortage of large-scale and diverse polyphonic music
datasets with multi-pitch annotations. We present a suite of self-supervised
learning objectives for multi-pitch estimation, which encourage the
concentration of support around harmonics, invariance to timbral
transformations, and equivariance to geometric transformations. These
objectives are sufficient to train an entirely convolutional autoencoder to
produce multi-pitch salience-grams directly, without any fine-tuning. Despite
training exclusively on a collection of synthetic single-note audio samples,
our fully self-supervised framework generalizes to polyphonic music mixtures,
and achieves performance comparable to supervised models trained on
conventional multi-pitch datasets.
Related papers
- MuSiCNet: A Gradual Coarse-to-Fine Framework for Irregularly Sampled Multivariate Time Series Analysis [45.34420094525063]
We introduce a novel perspective that irregularity is essentially relative in some senses.
MuSiCNet is an ISMTS analysis framework that competitive with SOTA in three mainstream tasks consistently.
arXiv Detail & Related papers (2024-12-02T02:50:01Z) - LC-Protonets: Multi-Label Few-Shot Learning for World Music Audio Tagging [65.72891334156706]
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification.
LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items.
Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music.
arXiv Detail & Related papers (2024-09-17T15:13:07Z) - Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles [95.49699178874683]
We propose DiffDiv, an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)
We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.
We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z) - Compatible Transformer for Irregularly Sampled Multivariate Time Series [75.79309862085303]
We propose a transformer-based encoder to achieve comprehensive temporal-interaction feature learning for each individual sample.
We conduct extensive experiments on 3 real-world datasets and validate that the proposed CoFormer significantly and consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-17T06:29:09Z) - Co-Learning Meets Stitch-Up for Noisy Multi-label Visual Recognition [70.00984078351927]
This paper focuses on reducing noise based on some inherent properties of multi-label classification and long-tailed learning under noisy cases.
We propose a Stitch-Up augmentation to synthesize a cleaner sample, which directly reduces multi-label noise.
A Heterogeneous Co-Learning framework is further designed to leverage the inconsistency between long-tailed and balanced distributions.
arXiv Detail & Related papers (2023-07-03T09:20:28Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - A Unifying Perspective on Multi-Calibration: Game Dynamics for
Multi-Objective Learning [63.20009081099896]
We provide a unifying framework for the design and analysis of multicalibrated predictors.
We exploit connections to game dynamics to achieve state-of-the-art guarantees for a diverse set of multicalibration learning problems.
arXiv Detail & Related papers (2023-02-21T18:24:17Z) - Self-supervision and Learnable STRFs for Age, Emotion, and Country
Prediction [26.860736835176617]
This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio.
We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets.
We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412.
arXiv Detail & Related papers (2022-06-25T06:09:10Z) - Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable
Evaluation [7.599399338954308]
Multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings.
In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components.
We compare variants of these architectures in different sizes for multi-pitch estimation using the MusicNet and Schubert Winterreise datasets.
arXiv Detail & Related papers (2022-02-18T13:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.