A Framework for Generative and Contrastive Learning of Audio
Representations
- URL: http://arxiv.org/abs/2010.11459v2
- Date: Tue, 16 Mar 2021 21:41:13 GMT
- Title: A Framework for Generative and Contrastive Learning of Audio
Representations
- Authors: Prateek Verma, Julius Smith
- Abstract summary: We present a framework for contrastive learning for audio representations in a self supervised frame work without access to ground truth labels.
We also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.
Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model.
- Score: 2.8935588665357077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a framework for contrastive learning for audio
representations, in a self supervised frame work without access to any ground
truth labels. The core idea in self supervised contrastive learning is to map
an audio signal and its various augmented versions (representative of salient
aspects of audio like pitch, timbre etc.) to a space where they are close
together, and are separated from other different signals. In addition we also
explore generative models based on state of the art transformer based
architectures for learning latent spaces for audio signals, without access to
any labels. Here, we map audio signals on a smaller scale to discrete
dictionary elements and train transformers to predict the next dictionary
element. We only use data as a method of supervision, bypassing the need of
labels needed to act as a supervision for training the deep neural networks. We
then use a linear classifier head in order to evaluate the performance of our
models, for both self supervised contrastive and generative transformer based
representations that are learned. Our system achieves considerable performance,
compared to a fully supervised method, with access to ground truth labels to
train the neural network model. These representations, with avail-ability of
large scale audio data show promise in various tasks for audio understanding
tasks
Related papers
- SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Self-supervised Graphs for Audio Representation Learning with Limited
Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples.
We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition.
Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z) - SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio.
We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification.
To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Contrastive Learning of General-Purpose Audio Representations [33.15189569532155]
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
arXiv Detail & Related papers (2020-10-21T11:56:22Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.