Related papers: Deep Clustering For General-Purpose Audio Representations

Deep Clustering For General-Purpose Audio Representations

URL: http://arxiv.org/abs/2110.08895v1
Date: Sun, 17 Oct 2021 19:03:51 GMT
Title: Deep Clustering For General-Purpose Audio Representations
Authors: Sreyan Ghosh and Sandesh V Katta and Ashish Seth and S. Umesh
Abstract summary: We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset. We transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes.
Score: 2.8086459907382224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supervised pre-training scheme. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset and transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available.

Related papers

TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization [7.448652734290433]
We tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts.
arXiv Detail & Related papers (2024-12-02T13:39:49Z)
Class-Incremental Grouping Network for Continual Audio-Visual Learning [42.284785756540806]
We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning. We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks. Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
arXiv Detail & Related papers (2023-09-11T07:36:16Z)
UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation [53.06337011259031]
We introduce UnFuSeD, a novel approach to leverage self-supervised learning for audio classification. We use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines.
arXiv Detail & Related papers (2023-03-10T02:43:36Z)
SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z)
SLICER: Learning universal audio representations using low-resource self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z)
Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)
Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network. Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced. We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z)
Unsupervised Discriminative Learning of Sounds for Audio Event Classification [43.81789898864507]
Network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet. We show a fast and effective alternative that pre-trains the model unsupervised, only on audio data and yet delivers on-par performance with ImageNet pre-training.
arXiv Detail & Related papers (2021-05-19T17:42:03Z)
Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning. We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
Contrastive Learning of General-Purpose Audio Representations [33.15189569532155]
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio. We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
arXiv Detail & Related papers (2020-10-21T11:56:22Z)
Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.