Related papers: Alignment Restricted Streaming Recurrent Neural Network Transducer

Alignment Restricted Streaming Recurrent Neural Network Transducer

URL: http://arxiv.org/abs/2011.03072v1
Date: Thu, 5 Nov 2020 19:38:54 GMT
Title: Alignment Restricted Streaming Recurrent Neural Network Transducer
Authors: Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer
Abstract summary: We propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T models. The Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER) The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency.
Score: 29.218353627837214
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times higher throughput for our LSTM model architecture, enabling faster training and convergence on GPUs.

Related papers

Accurate Mapping of RNNs on Neuromorphic Hardware with Adaptive Spiking Neurons [2.9410174624086025]
We present a $SigmaDelta$-low-pass RNN (lpRNN) for mapping rate-based RNNs to spiking neural networks (SNNs) An adaptive spiking neuron model encodes signals using $SigmaDelta$-modulation and enables precise mapping. We demonstrate the implementation of the lpRNN on Intel's neuromorphic research chip Loihi.
arXiv Detail & Related papers (2024-07-18T14:06:07Z)
CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition [8.302549684364195]
We propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.
arXiv Detail & Related papers (2023-07-26T11:59:14Z)
ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition [14.384132377946154]
We introduce a new streaming ASR model, ConvRNN-T, with a novel convolutional context consisting of local and global context encoders. We show ConvRNN-T outperforms RNN-T, Conformer, and ContextNet onspeech and in-house data. ConvRNN-T's superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies.
arXiv Detail & Related papers (2022-09-29T15:33:41Z)
Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation [70.75043144299168]
Spiking Neural Network (SNN) is a promising energy-efficient AI model when implemented on neuromorphic hardware. It is a challenge to efficiently train SNNs due to their non-differentiability. We propose the Differentiation on Spike Representation (DSR) method, which could achieve high performance.
arXiv Detail & Related papers (2022-05-01T12:44:49Z)
Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing [74.31472195046099]
We exploit a low-rank tensor-train deep neural network (TT-DNN) to build an end-to-end deep learning pipeline, namely LR-TT-DNN. A hybrid model combining LR-TT-DNN with a convolutional neural network (CNN) is set up to boost the performance. Our empirical evidence demonstrates that the LR-TT-DNN and CNN+(LR-TT-DNN) models with fewer model parameters can outperform the TT-DNN and CNN+(LR-TT-DNN) counterparts.
arXiv Detail & Related papers (2022-03-11T15:55:34Z)
Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems. We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z)
Spike-inspired Rank Coding for Fast and Accurate Recurrent Neural Networks [5.986408771459261]
Biological spiking neural networks (SNNs) can temporally encode information in their outputs, whereas artificial neural networks (ANNs) conventionally do not. Here we show that temporal coding such as rank coding (RC) inspired by SNNs can also be applied to conventional ANNs such as LSTMs. RC-training also significantly reduces time-to-insight during inference, with a minimal decrease in accuracy. We demonstrate these in two toy problems of sequence classification, and in a temporally-encoded MNIST dataset where our RC model achieves 99.19% accuracy after the first input time-step
arXiv Detail & Related papers (2021-10-06T15:51:38Z)
Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning. To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z)
Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs) These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training. Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z)
Progressive Tandem Learning for Pattern Recognition with Deep Spiking Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency. We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z)
Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research. In this work, we leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.