An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2110.04590v1
- Date: Sat, 9 Oct 2021 15:06:09 GMT
- Title: An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition
- Authors: Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu,
Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee,
Shinji Watanabe
- Abstract summary: We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
- Score: 98.70304981174748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pretraining on speech data has achieved a lot of progress.
High-fidelity representation of the speech signal is learned from a lot of
untranscribed data and shows promising performance. Recently, there are several
works focusing on evaluating the quality of self-supervised pretrained
representations on various tasks without domain restriction, e.g. SUPERB.
However, such evaluations do not provide a comprehensive comparison among many
ASR benchmark corpora. In this paper, we focus on the general applications of
pretrained speech representations, on advanced end-to-end automatic speech
recognition (E2E-ASR) models. We select several pretrained speech
representations and present the experimental results on various open-source and
publicly available corpora for E2E-ASR. Without any modification of the
back-end model architectures or training strategy, some of the experiments with
pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or
outperform current state-of-the-art (SOTA) recognition performance. Moreover,
we further explore more scenarios for whether the pretraining representations
are effective, such as the cross-language or overlapped speech. The scripts,
configuratons and the trained models have been released in ESPnet to let the
community reproduce our experiments and improve them.
Related papers
- Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations [1.6008229267455227]
We propose a multi-view SSL pre-training technique that can be applied to various representations of speech, including the ones generated by large speech models.
Our experiments, based on wav2vec 2.0, spectral and paralinguistic features, demonstrate that the proposed framework boosts the SER performance, by up to 10% in Unweighted Average Recall.
arXiv Detail & Related papers (2024-06-12T06:06:55Z) - A Comparative Study of Pre-trained Speech and Audio Embeddings for
Speech Emotion Recognition [0.0]
Speech Emotion Recognition (SER) has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning.
Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks.
We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms on the derived embeddings.
The results of our study indicate that the best performance is achieved by algorithms trained on embeddings
arXiv Detail & Related papers (2023-04-22T19:56:35Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Masked Autoencoders As The Unified Learners For Pre-Trained Sentence
Representation [77.47617360812023]
We extend the recently proposed MAE style pre-training strategy, RetroMAE, to support a wide variety of sentence representation tasks.
The first stage performs RetroMAE over generic corpora, like Wikipedia, BookCorpus, etc., from which the base model is learned.
The second stage takes place on domain-specific data, e.g., MS MARCO and NLI, where the base model is continuingly trained based on RetroMAE and contrastive learning.
arXiv Detail & Related papers (2022-07-30T14:34:55Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.