On the Comparison of Popular End-to-End Models for Large Scale Speech
Recognition
- URL: http://arxiv.org/abs/2005.14327v2
- Date: Thu, 30 Jul 2020 01:57:22 GMT
- Title: On the Comparison of Popular End-to-End Models for Large Scale Speech
Recognition
- Authors: Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu
- Abstract summary: There are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED.
In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes.
We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.
- Score: 42.31610064372749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been a strong push to transition from hybrid models to
end-to-end (E2E) models for automatic speech recognition. Currently, there are
three promising E2E methods: recurrent neural network transducer (RNN-T), RNN
attention-based encoder-decoder (AED), and Transformer-AED. In this study, we
conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models,
in both non-streaming and streaming modes. We use 65 thousand hours of
Microsoft anonymized training data to train these models. As E2E models are
more data hungry, it is better to compare their effectiveness with large amount
of training data. To the best of our knowledge, no such comprehensive study has
been conducted yet. We show that although AED models are stronger than RNN-T in
the non-streaming mode, RNN-T is very competitive in streaming mode if its
encoder can be properly initialized. Among all three E2E models,
transformer-AED achieved the best accuracy in both streaming and non-streaming
mode. We show that both streaming RNN-T and transformer-AED models can obtain
better accuracy than a highly-optimized hybrid model.
Related papers
- Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge
Distillation [6.617487928813374]
We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of.483 mAP on AudioSet.
arXiv Detail & Related papers (2022-11-09T09:58:22Z) - Towards Robust k-Nearest-Neighbor Machine Translation [72.9252395037097]
k-Nearest-Neighbor Machine Translation (kNN-MT) becomes an important research direction of NMT in recent years.
Its main idea is to retrieve useful key-value pairs from an additional datastore to modify translations without updating the NMT model.
The underlying retrieved noisy pairs will dramatically deteriorate the model performance.
We propose a confidence-enhanced kNN-MT model with robust training to alleviate the impact of noise.
arXiv Detail & Related papers (2022-10-17T07:43:39Z) - A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants.
While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem.
In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z) - Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming
E2E ASR via Supernet [24.62661549442265]
We propose Omni-sparsity DNN, where a single neural network can be pruned to generate optimized model for a large range of model sizes.
Our results show great saving on training time and resources with similar or better accuracy on LibriSpeech compared to individually pruned models.
arXiv Detail & Related papers (2021-10-15T20:28:27Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - Developing RNN-T Models Surpassing High-Performance Hybrid Models with
Customization Capability [46.73349163361723]
Recurrent neural network transducer (RNN-T) is a promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.
We describe our recent development of RNN-T models with reduced GPU memory consumption during training.
We study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios.
arXiv Detail & Related papers (2020-07-30T02:35:20Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.