Spectrograms Are Sequences of Patches
- URL: http://arxiv.org/abs/2210.15988v1
- Date: Fri, 28 Oct 2022 08:39:36 GMT
- Title: Spectrograms Are Sequences of Patches
- Authors: Leyi Zhao, Yi Li
- Abstract summary: We design a self-supervised model that captures a spectrogram of music as a series of patches: Patchifier.
We do not use labeled data for the pre-training process, only a subset of the MTAT dataset containing 16k music clips.
Our model achieves a considerably acceptable result compared to other audio representation models.
- Score: 5.253100011321437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised pre-training models have been used successfully in several
machine learning domains. However, only a tiny amount of work is related to
music. In our work, we treat a spectrogram of music as a series of patches and
design a self-supervised model that captures the features of these sequential
patches: Patchifier, which makes good use of self-supervised learning methods
from both NLP and CV domains. We do not use labeled data for the pre-training
process, only a subset of the MTAT dataset containing 16k music clips. After
pre-training, we apply the model to several downstream tasks. Our model
achieves a considerably acceptable result compared to other audio
representation models. Meanwhile, our work demonstrates that it makes sense to
consider audio as a series of patch segments.
Related papers
- MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization [24.991558192161]
We propose a self-supervised music representation learning model for music understanding.
MuQ is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ)
Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models.
arXiv Detail & Related papers (2025-01-02T07:08:29Z) - Parameter-Efficient Transfer Learning for Music Foundation Models [51.61531917413708]
We investigate the use of parameter-efficient transfer learning (PETL) for music foundation models.
PETL methods outperform both probing and fine-tuning on music auto-tagging.
PETL methods achieve similar results as fine-tuning with significantly less training cost.
arXiv Detail & Related papers (2024-11-28T20:50:40Z) - An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging [6.363158395541767]
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data.
In this study, we investigate and compare the performance of new self-supervised methods for music tagging.
arXiv Detail & Related papers (2024-04-14T07:56:08Z) - Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features.
We propose a simple yet effective model that only relies on feed-forward neural networks.
Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Supervised and Unsupervised Learning of Audio Representations for Music
Understanding [9.239657838690226]
We show how the domain of pre-training datasets affects the adequacy of the resulting audio embeddings for downstream tasks.
We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-10-07T20:07:35Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Self-supervised Graphs for Audio Representation Learning with Limited
Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples.
We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition.
Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z) - Multi-Task Self-Training for Learning General Representations [97.01728635294879]
Multi-task self-training (MuST) harnesses the knowledge in independent specialized teacher models to train a single general student model.
MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets.
arXiv Detail & Related papers (2021-08-25T17:20:50Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.