A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency
- URL: http://arxiv.org/abs/2003.12710v2
- Date: Fri, 1 May 2020 21:36:25 GMT
- Title: A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency
- Authors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang,
Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen,
Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli
Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak,
David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai,
Yonghui Wu, Yu Zhang, Ding Zhao
- Abstract summary: We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
- Score: 88.08721721440429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform
state-of-the-art conventional models with respect to both quality, i.e., word
error rate (WER), and latency, i.e., the time the hypothesis is finalized after
the user stops speaking. In this paper, we develop a first-pass Recurrent
Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell
(LAS) rescorer that surpasses a conventional model in both quality and latency.
On the quality side, we incorporate a large number of utterances across varied
domains to increase acoustic diversity and the vocabulary seen by the model. We
also train with accented English speech to make the model more robust to
different pronunciations. In addition, given the increased amount of training
data, we explore a varied learning rate schedule. On the latency front, we
explore using the end-of-sentence decision emitted by the RNN-T model to close
the microphone, and also introduce various optimizations to improve the speed
of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and
latency tradeoff compared to a conventional model. For example, for the same
latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more
than 400-times smaller in model size.
Related papers
- On Comparison of Encoders for Attention based End to End Speech
Recognition in Standalone and Rescoring Mode [1.7704011486040847]
Non-streaming models provide better performance as they look at the entire audio context.
We show that the Transformer model offers acceptable WER with the lowest latency requirements.
We highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER)
arXiv Detail & Related papers (2022-06-26T09:12:27Z) - Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming
E2E ASR via Supernet [24.62661549442265]
We propose Omni-sparsity DNN, where a single neural network can be pruned to generate optimized model for a large range of model sizes.
Our results show great saving on training time and resources with similar or better accuracy on LibriSpeech compared to individually pruned models.
arXiv Detail & Related papers (2021-10-15T20:28:27Z) - Multi-mode Transformer Transducer with Stochastic Future Context [53.005638503544866]
Multi-mode speech recognition models can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can still achieve reliable accuracy.
We show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
arXiv Detail & Related papers (2021-06-17T18:42:11Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Developing RNN-T Models Surpassing High-Performance Hybrid Models with
Customization Capability [46.73349163361723]
Recurrent neural network transducer (RNN-T) is a promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.
We describe our recent development of RNN-T models with reduced GPU memory consumption during training.
We study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios.
arXiv Detail & Related papers (2020-07-30T02:35:20Z) - Phone Features Improve Speech Translation [69.54616570679343]
End-to-end models for speech translation (ST) more tightly couple speech recognition (ASR) and machine translation (MT)
We compare cascaded and end-to-end models across high, medium, and low-resource conditions, and show that cascades remain stronger baselines.
We show that these features improve both architectures, closing the gap between end-to-end models and cascades, and outperforming previous academic work -- by up to 9 BLEU on our low-resource setting.
arXiv Detail & Related papers (2020-05-27T22:05:10Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.