Self-Supervised Speech Representation Learning: A Review
- URL: http://arxiv.org/abs/2205.10643v1
- Date: Sat, 21 May 2022 16:52:57 GMT
- Title: Self-Supervised Speech Representation Learning: A Review
- Authors: Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn,
Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu,
Lars Maal{\o}e, Tara N. Sainath, Shinji Watanabe
- Abstract summary: Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
- Score: 105.1545308184483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although supervised deep learning has revolutionized speech and audio
processing, it has necessitated the building of specialist models for
individual tasks and application scenarios. It is likewise difficult to apply
this to dialects and languages for which only limited labeled data is
available. Self-supervised representation learning methods promise a single
universal model that would benefit a wide variety of tasks and domains. Such
methods have shown success in natural language processing and computer vision
domains, achieving new levels of performance while reducing the number of
labels required for many downstream scenarios. Speech representation learning
is experiencing similar progress in three main categories: generative,
contrastive, and predictive methods. Other approaches rely on multi-modal data
for pre-training, mixing text or visual data streams with speech. Although
self-supervised speech representation is still a nascent research area, it is
closely related to acoustic word embedding and learning with zero lexical
resources, both of which have seen active research for many years. This review
presents approaches for self-supervised speech representation learning and
their connection to other research areas. Since many current methods focus
solely on automatic speech recognition as a downstream task, we review recent
efforts on benchmarking learned representations to extend the application
beyond speech recognition.
Related papers
- Few-Shot Spoken Language Understanding via Joint Speech-Text Models [18.193191170754744]
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations.
We leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks.
By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data.
arXiv Detail & Related papers (2023-10-09T17:59:21Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Speech representation learning: Learning bidirectional encoders with
single-view, multi-view, and multi-task methods [7.1345443932276424]
This thesis focuses on representation learning for sequence data over time or space.
It aims to improve downstream sequence prediction tasks by using the learned representations.
arXiv Detail & Related papers (2023-07-25T20:38:55Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - A Brief Overview of Unsupervised Neural Speech Representation Learning [12.850357461259197]
We review the development of unsupervised representation learning for speech over the last decade.
We identify two primary model categories: self-supervised methods and probabilistic latent variable models.
arXiv Detail & Related papers (2022-03-01T11:15:35Z) - data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language [85.9019051663368]
data2vec is a framework that uses the same learning method for either speech, NLP or computer vision.
The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.
Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance.
arXiv Detail & Related papers (2022-02-07T22:52:11Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.