Pushing the Limits of Non-Autoregressive Speech Recognition
- URL: http://arxiv.org/abs/2104.03416v1
- Date: Wed, 7 Apr 2021 22:17:20 GMT
- Title: Pushing the Limits of Non-Autoregressive Speech Recognition
- Authors: Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan
- Abstract summary: We push the limits of non-autoregressive state-of-the-art results for multiple datasets.
We leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training.
We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
- Score: 24.299771352483322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We combine recent advancements in end-to-end speech recognition to
non-autoregressive automatic speech recognition. We push the limits of
non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech,
Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC
on giant Conformer neural network architectures with SpecAugment and wav2vec2
pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets,
5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without
a language model.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition [0.0]
Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication.
We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment to train a multimodal model with a shared latent representation.
To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER.
arXiv Detail & Related papers (2024-03-02T21:15:24Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - SpeechStew: Simply Mix All Available Speech Recognition Data to Train
One Large Neural Network [45.59907668722702]
We present SpeechStew, a speech recognition model that is trained on a combination of publicly available speech recognition datasets.
Our results include 9.0% WER on AMI-IHM, 4.7% WER on Switchboard, 8.3% WER on CallHome, and 1.3% on WSJ.
We also demonstrate that SpeechStew learns powerful transfer learning representations.
arXiv Detail & Related papers (2021-04-05T20:13:36Z) - Pushing the Limits of Semi-Supervised Learning for Automatic Speech
Recognition [97.44056170380726]
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech.
We carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training.
We are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
arXiv Detail & Related papers (2020-10-20T17:58:13Z) - Improved Noisy Student Training for Automatic Speech Recognition [89.8397907990268]
"Noisy student training" is an iterative self-training method that leverages augmentation to improve network performance.
We find effective methods to filter, balance and augment the data generated in between self-training iterations.
We are able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%)
arXiv Detail & Related papers (2020-05-19T17:57:29Z) - Leveraging End-to-End Speech Recognition with Neural Architecture Search [0.0]
We show that a large improvement in the accuracy of deep speech models can be achieved with effective Neural Architecture Optimization.
Our method achieves test error of 7% Word Error Rate (WER) on the LibriSpeech corpus and 13% Phone Error Rate (PER) on the TIMIT corpus, on par with state-of-the-art results.
arXiv Detail & Related papers (2019-12-11T08:15:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.