Improved Noisy Student Training for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2005.09629v2
- Date: Thu, 29 Oct 2020 23:26:24 GMT
- Title: Improved Noisy Student Training for Automatic Speech Recognition
- Authors: Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li,
Yonghui Wu and Quoc V. Le
- Abstract summary: "Noisy student training" is an iterative self-training method that leverages augmentation to improve network performance.
We find effective methods to filter, balance and augment the data generated in between self-training iterations.
We are able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%)
- Score: 89.8397907990268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).
Related papers
- Improving Speech Recognition on Noisy Speech via Speech Enhancement with
Multi-Discriminators CycleGAN [41.88097793717185]
We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech.
We show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data.
arXiv Detail & Related papers (2021-12-12T19:56:34Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z) - Multitask Training with Text Data for End-to-End Speech Recognition [45.35605825009208]
We propose a multitask training method for attention-based end-to-end speech recognition models.
We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data.
arXiv Detail & Related papers (2020-10-27T14:29:28Z) - Self-training and Pre-training are Complementary for Speech Recognition [64.85342993297677]
Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data.
We show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups.
arXiv Detail & Related papers (2020-10-22T04:15:37Z) - Pushing the Limits of Semi-Supervised Learning for Automatic Speech
Recognition [97.44056170380726]
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech.
We carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training.
We are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
arXiv Detail & Related papers (2020-10-20T17:58:13Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.