Related papers: Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning

Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning

URL: http://arxiv.org/abs/2201.01771v2
Date: Sun, 16 Jul 2023 01:12:36 GMT
Title: Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning
Authors: Dorian Desblancs
Abstract summary: We present a new self-supervised learning pretext task for beat tracking and downbeat estimation. It makes use of Spleeter, an audio source separation model, to separate a song's drums from the rest of its signal. It is notably one of the first works to use audio source separation as a fundamental component of self-supervision.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Annotating musical beats is a very long and tedious process. In order to combat this problem, we present a new self-supervised learning pretext task for beat tracking and downbeat estimation. This task makes use of Spleeter, an audio source separation model, to separate a song's drums from the rest of its signal. The first set of signals are used as positives, and by extension negatives, for contrastive learning pre-training. The drum-less signals, on the other hand, are used as anchors. When pre-training a fully-convolutional and recurrent model using this pretext task, an onset function is learned. In some cases, this function is found to be mapped to periodic elements in a song. We find that pre-trained models outperform randomly initialized models when a beat tracking training set is extremely small (less than 10 examples). When this is not the case, pre-training leads to a learning speed-up that causes the model to overfit to the training set. More generally, this work defines new perspectives in the realm of musical self-supervised learning. It is notably one of the first works to use audio source separation as a fundamental component of self-supervision.

Related papers

Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation [49.062766449989525]
Generative models of music audio are typically used to generate output based solely on a text prompt or melody.<n>Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model.
arXiv Detail & Related papers (2025-07-07T10:46:07Z)
Parameter-Efficient Transfer Learning for Music Foundation Models [51.61531917413708]
We investigate the use of parameter-efficient transfer learning (PETL) for music foundation models. PETL methods outperform both probing and fine-tuning on music auto-tagging. PETL methods achieve similar results as fine-tuning with significantly less training cost.
arXiv Detail & Related papers (2024-11-28T20:50:40Z)
MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens. We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z)
An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging [6.363158395541767]
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. In this study, we investigate and compare the performance of new self-supervised methods for music tagging.
arXiv Detail & Related papers (2024-04-14T07:56:08Z)
Refining Pre-Trained Motion Models [56.18044168821188]
We take on the challenge of improving state-of-the-art supervised models with self-supervised training. We focus on obtaining a "clean" training signal from real-world unlabelled video. We show that our method yields reliable gains over fully-supervised methods in real videos.
arXiv Detail & Related papers (2024-01-01T18:59:33Z)
Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models [2.569647910019739]
This work implements and compares adversarial and non-adversarial training of recurrent neural network music composers on MIDI data. The evaluation indicates that adversarial training produces more aesthetically pleasing music.
arXiv Detail & Related papers (2022-11-01T20:23:49Z)
Spectrograms Are Sequences of Patches [5.253100011321437]
We design a self-supervised model that captures a spectrogram of music as a series of patches: Patchifier. We do not use labeled data for the pre-training process, only a subset of the MTAT dataset containing 16k music clips. Our model achieves a considerably acceptable result compared to other audio representation models.
arXiv Detail & Related papers (2022-10-28T08:39:36Z)
Large-Scale Pre-training for Person Re-identification with Noisy Labels [125.49696935852634]
We develop a large-scale Pre-training framework utilizing Noisy Labels (PNL) In principle, joint learning of these three modules not only clusters similar examples to one prototype, but also rectifies noisy labels based on the prototype assignment. This simple pre-training task provides a scalable way to learn SOTA Re-ID representations from scratch on "LUPerson-NL" without bells and whistles.
arXiv Detail & Related papers (2022-03-30T17:59:58Z)
Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced Training for Neural Machine Translation [15.309573393914462]
Neural networks tend to forget the previously learned knowledge when learning multiple tasks sequentially from dynamic data distributions. This problem is called textitcatastrophic forgetting, which is a fundamental challenge in the continual learning of neural networks. We propose Complementary Online Knowledge Distillation (COKD), which uses dynamically updated teacher models trained on specific data orders to iteratively provide complementary knowledge to the student model.
arXiv Detail & Related papers (2022-03-08T08:08:45Z)
Catch-A-Waveform: Learning to Generate Audio from a Single Short Example [33.96833901121411]
We present a GAN-based generative model that can be trained on one short audio signal from any domain. We show that in all cases, no more than 20 seconds of training audio commonly suffice for our model to achieve state-of-the-art results.
arXiv Detail & Related papers (2021-06-11T14:35:11Z)
Fast accuracy estimation of deep learning based multi-class musical source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z)
Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning [55.854205371307884]
We formalize the music-conditioned dance generation as a sequence-to-sequence learning problem. We propose a novel curriculum learning strategy to alleviate error accumulation of autoregressive models in long motion sequence generation. Our approach significantly outperforms the existing state-of-the-arts on automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-06-11T00:08:25Z)
Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.