Contrastive Learning of General-Purpose Audio Representations
- URL: http://arxiv.org/abs/2010.10915v1
- Date: Wed, 21 Oct 2020 11:56:22 GMT
- Title: Contrastive Learning of General-Purpose Audio Representations
- Authors: Aaqib Saeed, David Grangier, Neil Zeghidour
- Abstract summary: We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
- Score: 33.15189569532155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce COLA, a self-supervised pre-training approach for learning a
general-purpose representation of audio. Our approach is based on contrastive
learning: it learns a representation which assigns high similarity to audio
segments extracted from the same recording while assigning lower similarity to
segments from different recordings. We build on top of recent advances in
contrastive learning for computer vision and reinforcement learning to design a
lightweight, easy-to-implement self-supervised model of audio. We pre-train
embeddings on the large-scale Audioset database and transfer these
representations to 9 diverse classification tasks, including speech, music,
animal sounds, and acoustic scenes. We show that despite its simplicity, our
method significantly outperforms previous self-supervised systems. We
furthermore conduct ablation studies to identify key design choices and release
a library to pre-train and fine-tune COLA models.
Related papers
- AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Deep Clustering For General-Purpose Audio Representations [2.8086459907382224]
We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations.
We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset.
We transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes.
arXiv Detail & Related papers (2021-10-17T19:03:51Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - BYOL for Audio: Self-Supervised Learning for General-Purpose Audio
Representation [40.116109908079935]
BYOL-A is an audio self-supervised learning method based on BYOL for learning general-purpose audio representation.
With a combination of normalization and augmentation techniques, BYOL-A achieves state-of-the-art results in various downstream tasks.
arXiv Detail & Related papers (2021-03-11T14:32:33Z) - A Framework for Generative and Contrastive Learning of Audio
Representations [2.8935588665357077]
We present a framework for contrastive learning for audio representations in a self supervised frame work without access to ground truth labels.
We also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.
Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model.
arXiv Detail & Related papers (2020-10-22T05:52:32Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.