Developing RNN-T Models Surpassing High-Performance Hybrid Models with
Customization Capability
- URL: http://arxiv.org/abs/2007.15188v1
- Date: Thu, 30 Jul 2020 02:35:20 GMT
- Title: Developing RNN-T Models Surpassing High-Performance Hybrid Models with
Customization Capability
- Authors: Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan
Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, and Yifan
Gong
- Abstract summary: Recurrent neural network transducer (RNN-T) is a promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.
We describe our recent development of RNN-T models with reduced GPU memory consumption during training.
We study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios.
- Score: 46.73349163361723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Because of its streaming nature, recurrent neural network transducer (RNN-T)
is a very promising end-to-end (E2E) model that may replace the popular hybrid
model for automatic speech recognition. In this paper, we describe our recent
development of RNN-T models with reduced GPU memory consumption during
training, better initialization strategy, and advanced encoder modeling with
future lookahead. When trained with Microsoft's 65 thousand hours of anonymized
training data, the developed RNN-T model surpasses a very well trained hybrid
model with both better recognition accuracy and lower latency. We further study
how to customize RNN-T models to a new domain, which is important for deploying
E2E models to practical scenarios. By comparing several methods leveraging
text-only data in the new domain, we found that updating RNN-T's prediction and
joint networks using text-to-speech generated from domain-specific text is the
most effective.
Related papers
- Online Evolutionary Neural Architecture Search for Multivariate
Non-Stationary Time Series Forecasting [72.89994745876086]
This work presents the Online Neuro-Evolution-based Neural Architecture Search (ONE-NAS) algorithm.
ONE-NAS is a novel neural architecture search method capable of automatically designing and dynamically training recurrent neural networks (RNNs) for online forecasting tasks.
Results demonstrate that ONE-NAS outperforms traditional statistical time series forecasting methods.
arXiv Detail & Related papers (2023-02-20T22:25:47Z) - Adversarial Learning Networks: Source-free Unsupervised Domain
Incremental Learning [0.0]
In a non-stationary environment, updating a DNN model requires parameter re-training or model fine-tuning.
We propose an unsupervised source-free method to update DNN classification models.
Unlike existing methods, our approach can update a DNN model incrementally for non-stationary source and target tasks without storing past training data.
arXiv Detail & Related papers (2023-01-28T02:16:13Z) - A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants.
While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem.
In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network [0.0]
We show that RNN-transducer models can be effectively adapted to new domains using only small amounts of textual data.
We show with multiple ASR evaluation tasks how this method can provide relative gains of 10-45% in target task WER.
arXiv Detail & Related papers (2021-04-22T15:21:41Z) - Exploring Deep Hybrid Tensor-to-Vector Network Architectures for
Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size.
CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z) - On the Comparison of Popular End-to-End Models for Large Scale Speech
Recognition [42.31610064372749]
There are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED.
In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes.
We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.
arXiv Detail & Related papers (2020-05-28T22:30:57Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.