SLICER: Learning universal audio representations using low-resource
self-supervised pre-training
- URL: http://arxiv.org/abs/2211.01519v2
- Date: Thu, 18 May 2023 01:31:48 GMT
- Title: SLICER: Learning universal audio representations using low-resource
self-supervised pre-training
- Authors: Ashish Seth and Sreyan Ghosh and S. Umesh and Dinesh Manocha
- Abstract summary: We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
- Score: 53.06337011259031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new Self-Supervised Learning (SSL) approach to pre-train
encoders on unlabeled audio data that reduces the need for large amounts of
labeled data for audio and speech classification. Our primary aim is to learn
audio representations that can generalize across a large variety of speech and
non-speech tasks in a low-resource un-labeled audio pre-training setting.
Inspired by the recent success of clustering and contrasting learning paradigms
for SSL-based speech representation learning, we propose SLICER (Symmetrical
Learning of Instance and Cluster-level Efficient Representations), which brings
together the best of both clustering and contrasting learning paradigms. We use
a symmetric loss between latent representations from student and teacher
encoders and simultaneously solve instance and cluster-level contrastive
learning tasks. We obtain cluster representations online by just projecting the
input spectrogram into an output subspace with dimensions equal to the number
of clusters. In addition, we propose a novel mel-spectrogram augmentation
procedure, k-mix, based on mixup, which does not require labels and aids
unsupervised representation learning for audio. Overall, SLICER achieves
state-of-the-art results on the LAPE Benchmark \cite{9868132}, significantly
outperforming DeLoRes-M and other prior approaches, which are pre-trained on
$10\times$ larger of unsupervised data. We will make all our codes available on
GitHub.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Deep Clustering For General-Purpose Audio Representations [2.8086459907382224]
We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations.
We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset.
We transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes.
arXiv Detail & Related papers (2021-10-17T19:03:51Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.