Pushing the Limits of Semi-Supervised Learning for Automatic Speech
Recognition
- URL: http://arxiv.org/abs/2010.10504v2
- Date: Wed, 20 Jul 2022 22:31:00 GMT
- Title: Pushing the Limits of Semi-Supervised Learning for Automatic Speech
Recognition
- Authors: Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu,
Ruoming Pang, Quoc V. Le and Yonghui Wu
- Abstract summary: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech.
We carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training.
We are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
- Score: 97.44056170380726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We employ a combination of recent developments in semi-supervised learning
for automatic speech recognition to obtain state-of-the-art results on
LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More
precisely, we carry out noisy student training with SpecAugment using giant
Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we
are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech
test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
Related papers
- Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Pushing the Limits of Non-Autoregressive Speech Recognition [24.299771352483322]
We push the limits of non-autoregressive state-of-the-art results for multiple datasets.
We leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training.
We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
arXiv Detail & Related papers (2021-04-07T22:17:20Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - Self-training and Pre-training are Complementary for Speech Recognition [64.85342993297677]
Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data.
We show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups.
arXiv Detail & Related papers (2020-10-22T04:15:37Z) - wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations [51.25118580050847]
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods.
wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
arXiv Detail & Related papers (2020-06-20T02:35:02Z) - Improved Noisy Student Training for Automatic Speech Recognition [89.8397907990268]
"Noisy student training" is an iterative self-training method that leverages augmentation to improve network performance.
We find effective methods to filter, balance and augment the data generated in between self-training iterations.
We are able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%)
arXiv Detail & Related papers (2020-05-19T17:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.