Related papers: Spectrograms Are Sequences of Patches

Spectrograms Are Sequences of Patches

URL: http://arxiv.org/abs/2210.15988v1
Date: Fri, 28 Oct 2022 08:39:36 GMT
Title: Spectrograms Are Sequences of Patches
Authors: Leyi Zhao, Yi Li
Abstract summary: We design a self-supervised model that captures a spectrogram of music as a series of patches: Patchifier. We do not use labeled data for the pre-training process, only a subset of the MTAT dataset containing 16k music clips. Our model achieves a considerably acceptable result compared to other audio representation models.
Score: 5.253100011321437
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised pre-training models have been used successfully in several machine learning domains. However, only a tiny amount of work is related to music. In our work, we treat a spectrogram of music as a series of patches and design a self-supervised model that captures the features of these sequential patches: Patchifier, which makes good use of self-supervised learning methods from both NLP and CV domains. We do not use labeled data for the pre-training process, only a subset of the MTAT dataset containing 16k music clips. After pre-training, we apply the model to several downstream tasks. Our model achieves a considerably acceptable result compared to other audio representation models. Meanwhile, our work demonstrates that it makes sense to consider audio as a series of patch segments.

Related papers

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization [24.991558192161]
We propose a self-supervised music representation learning model for music understanding. MuQ is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ) Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models.
arXiv Detail & Related papers (2025-01-02T07:08:29Z)
Parameter-Efficient Transfer Learning for Music Foundation Models [51.61531917413708]
We investigate the use of parameter-efficient transfer learning (PETL) for music foundation models. PETL methods outperform both probing and fine-tuning on music auto-tagging. PETL methods achieve similar results as fine-tuning with significantly less training cost.
arXiv Detail & Related papers (2024-11-28T20:50:40Z)
An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging [6.363158395541767]
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. In this study, we investigate and compare the performance of new self-supervised methods for music tagging.
arXiv Detail & Related papers (2024-04-14T07:56:08Z)
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. We propose a simple yet effective model that only relies on feed-forward neural networks. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
Supervised and Unsupervised Learning of Audio Representations for Music Understanding [9.239657838690226]
We show how the domain of pre-training datasets affects the adequacy of the resulting audio embeddings for downstream tasks. We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-10-07T20:07:35Z)
Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)
Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples. We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition. Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z)
SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z)
Multi-Task Self-Training for Learning General Representations [97.01728635294879]
Multi-task self-training (MuST) harnesses the knowledge in independent specialized teacher models to train a single general student model. MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets.
arXiv Detail & Related papers (2021-08-25T17:20:50Z)
MoPro: Webly Supervised Learning with Momentum Prototypes [140.76848620407168]
We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning. MoPro achieves state-of-the-art performance on WebVision, a weakly-labeled noisy dataset.
arXiv Detail & Related papers (2020-09-17T00:59:59Z)
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks. In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.