On the limit of English conversational speech recognition
- URL: http://arxiv.org/abs/2105.00982v1
- Date: Mon, 3 May 2021 16:32:38 GMT
- Title: On the limit of English conversational speech recognition
- Authors: Zolt\'an T\"uske, George Saon, Brian Kingsbury
- Abstract summary: We show that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition.
We reduce the recognition errors of our LSTM system on Switchboard-300 by 4% relative.
We report 5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.
- Score: 28.395662280898787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In our previous work we demonstrated that a single headed attention
encoder-decoder model is able to reach state-of-the-art results in
conversational speech recognition. In this paper, we further improve the
results for both Switchboard 300 and 2000. Through use of an improved
optimizer, speaker vector embeddings, and alternative speech representations we
reduce the recognition errors of our LSTM system on Switchboard-300 by 4%
relative. Compensation of the decoder model with the probability ratio approach
allows more efficient integration of an external language model, and we report
5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM
models. Our study also considers the recently proposed conformer, and more
advanced self-attention based language models. Overall, the conformer shows
similar performance to the LSTM; nevertheless, their combination and decoding
with an improved LM reaches a new record on Switchboard-300, 5.0% and 10.0% WER
on SWB and CHM. Our findings are also confirmed on Switchboard-2000, and a new
state of the art is reported, practically reaching the limit of the benchmark.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - Improving the Training Recipe for a Robust Conformer-based Hybrid Model [46.78701739177677]
We investigate various methods for speaker adaptive training (SAT) based on feature-space approaches for a conformer-based acoustic model (AM)
We propose a method, called weighted-Simple-Add, which adds weighted speaker information vectors to the input of the multi-head self-attention module of the conformer AM.
We extend and improve this recipe where we achieve 11% relative improvement in terms of word-error-rate (WER) on Switchboard 300h Hub5'00 dataset.
arXiv Detail & Related papers (2022-06-26T20:01:08Z) - 4-bit Quantization of LSTM-based Speech Recognition Models [40.614677908909705]
We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition.
We show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations.
arXiv Detail & Related papers (2021-08-27T00:59:52Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Advancing RNN Transducer Technology for Speech Recognition [25.265297366014277]
We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks.
The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe.
We report a 5.9% and 12.5% word error rate on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and a 12.7% WER on the Mozilla CommonVoice Italian test set.
arXiv Detail & Related papers (2021-03-17T22:19:11Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Single headed attention based sequence-to-sequence model for
state-of-the-art results on Switchboard [36.06535394840605]
We show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database.
Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00.
arXiv Detail & Related papers (2020-01-20T22:03:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.