Joint Masked CPC and CTC Training for ASR
- URL: http://arxiv.org/abs/2011.00093v2
- Date: Sat, 13 Feb 2021 18:59:35 GMT
- Title: Joint Masked CPC and CTC Training for ASR
- Authors: Chaitanya Talnikar, Tatiana Likhomanenko, Ronan Collobert, Gabriel
Synnaeve
- Abstract summary: We demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data.
We show that this joint training method directly optimized performance for the downstream ASR task using unsupervised data.
- Score: 29.41599824919278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) has shown promise in learning representations
of audio that are useful for automatic speech recognition (ASR). But, training
SSL models like wav2vec~2.0 requires a two-stage pipeline. In this paper we
demonstrate a single-stage training of ASR models that can utilize both
unlabeled and labeled data. During training, we alternately minimize two
losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the
supervised audio-to-text alignment loss Connectionist Temporal Classification
(CTC). We show that this joint training method directly optimizes performance
for the downstream ASR task using unsupervised data while achieving similar
word error rates to wav2vec~2.0 on the Librispeech 100-hour dataset. Finally,
we postulate that solving the contrastive task is a regularization for the
supervised CTC loss.
Related papers
- AsyCo: An Asymmetric Dual-task Co-training Model for Partial-label Learning [53.97072488455662]
Self-training models achieve state-of-the-art performance but suffer from error accumulation problem caused by mistakenly disambiguated instances.
We propose an asymmetric dual-task co-training model called AsyCo, which forces its two networks, i.e., a disambiguation network and an auxiliary network, to learn from different views explicitly.
Experiments on both uniform and instance-dependent partially labeled datasets demonstrate the effectiveness of AsyCo.
arXiv Detail & Related papers (2024-07-21T02:08:51Z) - Efficient infusion of self-supervised representations in Automatic Speech Recognition [1.2972104025246092]
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks.
We propose two simple approaches that use framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model into the ASR architecture.
Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets.
arXiv Detail & Related papers (2024-04-19T05:01:12Z) - Learning from Flawed Data: Weakly Supervised Automatic Speech
Recognition [30.544499309503863]
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data.
Human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models.
We propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties.
arXiv Detail & Related papers (2023-09-26T12:58:40Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Improving Hybrid CTC/Attention End-to-end Speech Recognition with
Pretrained Acoustic and Language Model [4.490054848527943]
We propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models.
To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.
arXiv Detail & Related papers (2021-12-14T09:38:31Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.