Alignment Restricted Streaming Recurrent Neural Network Transducer
- URL: http://arxiv.org/abs/2011.03072v1
- Date: Thu, 5 Nov 2020 19:38:54 GMT
- Title: Alignment Restricted Streaming Recurrent Neural Network Transducer
- Authors: Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le,
Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer
- Abstract summary: We propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T models.
The Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER)
The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency.
- Score: 29.218353627837214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is a growing interest in the speech community in developing Recurrent
Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR)
applications. RNN-T is trained with a loss function that does not enforce
temporal alignment of the training transcripts and audio. As a result, RNN-T
models built with uni-directional long short term memory (LSTM) encoders tend
to wait for longer spans of input audio, before streaming already decoded ASR
tokens. In this work, we propose a modification to the RNN-T loss function and
develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text
alignment information to guide the loss computation. We compare the proposed
method with existing works, such as monotonic RNN-T, on LibriSpeech and
in-house datasets. We show that the Ar-RNN-T loss provides a refined control to
navigate the trade-offs between the token emission delays and the Word Error
Rate (WER). The Ar-RNN-T models also improve downstream applications such as
the ASR End-pointing by guaranteeing token emissions within any given range of
latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times
higher throughput for our LSTM model architecture, enabling faster training and
convergence on GPUs.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.