Tempo estimation as fully self-supervised binary classification
- URL: http://arxiv.org/abs/2401.08891v1
- Date: Wed, 17 Jan 2024 00:15:16 GMT
- Title: Tempo estimation as fully self-supervised binary classification
- Authors: Florian Henkel, Jaehun Kim, Matthew C. McCallum, Samuel E. Sandberg,
Matthew E. P. Davies
- Abstract summary: We propose a fully self-supervised approach that does not rely on any human labeled data.
Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo.
- Score: 6.255143207183722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of global tempo estimation in musical audio.
Given that annotating tempo is time-consuming and requires certain musical
expertise, few publicly available data sources exist to train machine learning
models for this task. Towards alleviating this issue, we propose a fully
self-supervised approach that does not rely on any human labeled data. Our
method builds on the fact that generic (music) audio embeddings already encode
a variety of properties, including information about tempo, making them easily
adaptable for downstream tasks. While recent work in self-supervised tempo
estimation aimed to learn a tempo specific representation that was subsequently
used to train a supervised classifier, we reformulate the task into the binary
classification problem of predicting whether a target track has the same or a
different tempo compared to a reference. While the former still requires
labeled training data for the final classification model, our approach uses
arbitrary unlabeled music data in combination with time-stretching for model
training as well as a small set of synthetically created reference samples for
predicting the final tempo. Evaluation of our approach in comparison with the
state-of-the-art reveals highly competitive performance when the constraint of
finding the precise tempo octave is relaxed.
Related papers
- Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation [3.8570045844185237]
We present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset.
Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems.
We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix.
arXiv Detail & Related papers (2024-08-05T14:34:40Z) - One-bit Supervision for Image Classification: Problem, Solution, and
Beyond [114.95815360508395]
This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification.
We propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm.
In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision.
arXiv Detail & Related papers (2023-11-26T07:39:00Z) - Tempo vs. Pitch: understanding self-supervised tempo estimation [0.783970968131292]
Self-supervision methods learn representations by solving pretext tasks that do not require human-generated labels.
We study the relationship between the input representation and data distribution for self-supervised tempo estimation.
arXiv Detail & Related papers (2023-04-14T00:08:08Z) - Informative regularization for a multi-layer perceptron RR Lyrae
classifier under data shift [3.303002683812084]
We propose a scalable and easily adaptable approach based on an informative regularization and an ad-hoc training procedure to mitigate the shift problem.
Our method provides a new path to incorporate knowledge from characteristic features into artificial neural networks to manage the underlying data shift problem.
arXiv Detail & Related papers (2023-03-12T02:49:19Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - Building for Tomorrow: Assessing the Temporal Persistence of Text
Classifiers [18.367109894193486]
Performance of text classification models can drop over time when new data to be classified is more distant in time from the data used for training.
This raises important research questions on the design of text classification models intended to persist over time.
We perform longitudinal classification experiments on three datasets spanning between 6 and 19 years.
arXiv Detail & Related papers (2022-05-11T12:21:14Z) - Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space.
We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z) - Self-supervised Pretraining with Classification Labels for Temporal
Activity Detection [54.366236719520565]
Temporal Activity Detection aims to predict activity classes per frame.
Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited.
This work proposes a novel self-supervised pretraining method for detection leveraging classification labels.
arXiv Detail & Related papers (2021-11-26T18:59:28Z) - Semi-supervised Facial Action Unit Intensity Estimation with Contrastive
Learning [54.90704746573636]
Our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2%$ of annotated frames.
We experimentally validate that our method outperforms existing methods when working with as little as $2%$ of randomly chosen data.
arXiv Detail & Related papers (2020-11-03T17:35:57Z) - Counting Out Time: Class Agnostic Video Repetition Counting in the Wild [82.26003709476848]
We present an approach for estimating the period with which an action is repeated in a video.
The crux of the approach lies in constraining the period prediction module to use temporal self-similarity.
We train this model, called Repnet, with a synthetic dataset that is generated from a large unlabeled video collection.
arXiv Detail & Related papers (2020-06-27T18:00:42Z) - Conditional Mutual information-based Contrastive Loss for Financial Time
Series Forecasting [12.0855096102517]
We present a representation learning framework for financial time series forecasting.
In this paper, we propose to first learn compact representations from time series data, then use the learned representations to train a simpler model for predicting time series movements.
arXiv Detail & Related papers (2020-02-18T15:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.