Advancing RNN Transducer Technology for Speech Recognition
- URL: http://arxiv.org/abs/2103.09935v1
- Date: Wed, 17 Mar 2021 22:19:11 GMT
- Title: Advancing RNN Transducer Technology for Speech Recognition
- Authors: George Saon, Zoltan Tueske, Daniel Bolanos and Brian Kingsbury
- Abstract summary: We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks.
The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe.
We report a 5.9% and 12.5% word error rate on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and a 12.7% WER on the Mozilla CommonVoice Italian test set.
- Score: 25.265297366014277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate a set of techniques for RNN Transducers (RNN-Ts) that were
instrumental in lowering the word error rate on three different tasks
(Switchboard 300 hours, conversational Spanish 780 hours and conversational
Italian 900 hours). The techniques pertain to architectural changes, speaker
adaptation, language model fusion, model combination and general training
recipe. First, we introduce a novel multiplicative integration of the encoder
and prediction network vectors in the joint network (as opposed to additive).
Second, we discuss the applicability of i-vector speaker adaptation to RNN-Ts
in conjunction with data perturbation. Third, we explore the effectiveness of
the recently proposed density ratio language model fusion for these tasks. Last
but not least, we describe the other components of our training recipe and
their effect on recognition performance. We report a 5.9% and 12.5% word error
rate on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation
and a 12.7% WER on the Mozilla CommonVoice Italian test set.
Related papers
- Employing Hybrid Deep Neural Networks on Dari Speech [0.0]
This article focuses on the recognition of individual words in the Dari language using the Mel-frequency cepstral coefficients (MFCCs) feature extraction method.
We evaluate three different deep neural network models: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Multilayer Perceptron (MLP)
arXiv Detail & Related papers (2023-05-04T23:10:53Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - On the limit of English conversational speech recognition [28.395662280898787]
We show that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition.
We reduce the recognition errors of our LSTM system on Switchboard-300 by 4% relative.
We report 5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.
arXiv Detail & Related papers (2021-05-03T16:32:38Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z) - ASTRAL: Adversarial Trained LSTM-CNN for Named Entity Recognition [16.43239147870092]
We propose an Adversarial Trained LSTM-CNN (ASTRAL) system to improve the current NER method from both the model structure and the training process.
Our system is evaluated on three benchmarks, CoNLL-03, OntoNotes 5.0, and WNUT-17, achieving state-of-the-art results.
arXiv Detail & Related papers (2020-09-02T13:15:25Z) - Attention-based Transducer for Online Speech Recognition [11.308675771607753]
We propose attention-based transducer with modification over RNN-T.
We introduce chunk-wise attention in the joint network and introduce self-attention in the encoder.
Our proposed model outperforms RNN-T for both training speed and accuracy.
arXiv Detail & Related papers (2020-05-18T07:26:33Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.