Deep Clustering For General-Purpose Audio Representations
- URL: http://arxiv.org/abs/2110.08895v1
- Date: Sun, 17 Oct 2021 19:03:51 GMT
- Title: Deep Clustering For General-Purpose Audio Representations
- Authors: Sreyan Ghosh and Sandesh V Katta and Ashish Seth and S. Umesh
- Abstract summary: We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations.
We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset.
We transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes.
- Score: 2.8086459907382224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce DECAR, a self-supervised pre-training approach for learning
general-purpose audio representations. Our system is based on clustering: it
utilizes an offline clustering step to provide target labels that act as
pseudo-labels for solving a prediction task. We develop on top of recent
advances in self-supervised learning for computer vision and design a
lightweight, easy-to-use self-supervised pre-training scheme. We pre-train
DECAR embeddings on a balanced subset of the large-scale Audioset dataset and
transfer those representations to 9 downstream classification tasks, including
speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct
ablation studies identifying key design choices and also make all our code and
pre-trained models publicly available.
Related papers
- Class-Incremental Grouping Network for Continual Audio-Visual Learning [42.284785756540806]
We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning.
We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks.
Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
arXiv Detail & Related papers (2023-09-11T07:36:16Z) - UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation [53.06337011259031]
We introduce UnFuSeD, a novel approach to leverage self-supervised learning for audio classification.
We use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step.
UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines.
arXiv Detail & Related papers (2023-03-10T02:43:36Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z) - Unsupervised Discriminative Learning of Sounds for Audio Event
Classification [43.81789898864507]
Network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet.
We show a fast and effective alternative that pre-trains the model unsupervised, only on audio data and yet delivers on-par performance with ImageNet pre-training.
arXiv Detail & Related papers (2021-05-19T17:42:03Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Contrastive Learning of General-Purpose Audio Representations [33.15189569532155]
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
arXiv Detail & Related papers (2020-10-21T11:56:22Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.