Supervised and Unsupervised Learning of Audio Representations for Music
Understanding
- URL: http://arxiv.org/abs/2210.03799v1
- Date: Fri, 7 Oct 2022 20:07:35 GMT
- Title: Supervised and Unsupervised Learning of Audio Representations for Music
Understanding
- Authors: Matthew C. McCallum, Filip Korzeniowski, Sergio Oramas, Fabien Gouyon,
Andreas F. Ehmann
- Abstract summary: We show how the domain of pre-training datasets affects the adequacy of the resulting audio embeddings for downstream tasks.
We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance.
- Score: 9.239657838690226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we provide a broad comparative analysis of strategies for
pre-training audio understanding models for several tasks in the music domain,
including labelling of genre, era, origin, mood, instrumentation, key, pitch,
vocal characteristics, tempo and sonority. Specifically, we explore how the
domain of pre-training datasets (music or generic audio) and the pre-training
methodology (supervised or unsupervised) affects the adequacy of the resulting
audio embeddings for downstream tasks.
We show that models trained via supervised learning on large-scale
expert-annotated music datasets achieve state-of-the-art performance in a wide
range of music labelling tasks, each with novel content and vocabularies. This
can be done in an efficient manner with models containing less than 100 million
parameters that require no fine-tuning or reparameterization for downstream
tasks, making this approach practical for industry-scale audio catalogs.
Within the class of unsupervised learning strategies, we show that the domain
of the training dataset can significantly impact the performance of
representations learned by the model. We find that restricting the domain of
the pre-training dataset to music allows for training with smaller batch sizes
while achieving state-of-the-art in unsupervised learning -- and in some cases,
supervised learning -- for music understanding.
We also corroborate that, while achieving state-of-the-art performance on
many tasks, supervised learning can cause models to specialize to the
supervised information provided, somewhat compromising a model's generality.
Related papers
- Foundation Models for Music: A Survey [77.77088584651268]
Foundations models (FMs) have profoundly impacted diverse sectors, including music.
This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music.
arXiv Detail & Related papers (2024-08-26T15:13:14Z) - An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging [6.363158395541767]
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data.
In this study, we investigate and compare the performance of new self-supervised methods for music tagging.
arXiv Detail & Related papers (2024-04-14T07:56:08Z) - Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Music Instrument Classification Reprogrammed [79.68916470119743]
"Reprogramming" is a technique that utilizes pre-trained deep and complex neural networks originally targeting a different task by modifying and mapping both the input and output of the pre-trained model.
We demonstrate that reprogramming can effectively leverage the power of the representation learned for a different task and that the resulting reprogrammed system can perform on par or even outperform state-of-the-art systems at a fraction of training parameters.
arXiv Detail & Related papers (2022-11-15T18:26:01Z) - Representation Learning for the Automatic Indexing of Sound Effects
Libraries [79.68916470119743]
We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size.
Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness.
arXiv Detail & Related papers (2022-08-18T23:46:13Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Multi-task Learning with Metadata for Music Mood Classification [0.0]
Mood recognition is an important problem in music informatics and has key applications in music discovery and recommendation.
We propose a multi-task learning approach in which a shared model is simultaneously trained for mood and metadata prediction tasks.
Applying our technique on the existing state-of-the-art convolutional neural networks for mood classification improves their performances consistently.
arXiv Detail & Related papers (2021-10-10T11:36:34Z) - Multi-Task Self-Supervised Pre-Training for Music Classification [36.21650132145048]
We apply self-supervised and multi-task learning methods for pre-training music encoders.
We investigate how these design choices interact with various downstream music classification tasks.
arXiv Detail & Related papers (2021-02-05T15:19:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.