Attention-based Transducer for Online Speech Recognition
- URL: http://arxiv.org/abs/2005.08497v1
- Date: Mon, 18 May 2020 07:26:33 GMT
- Title: Attention-based Transducer for Online Speech Recognition
- Authors: Bin Wang, Yan Yin, Hui Lin
- Abstract summary: We propose attention-based transducer with modification over RNN-T.
We introduce chunk-wise attention in the joint network and introduce self-attention in the encoder.
Our proposed model outperforms RNN-T for both training speed and accuracy.
- Score: 11.308675771607753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies reveal the potential of recurrent neural network transducer
(RNN-T) for end-to-end (E2E) speech recognition. Among some most popular E2E
systems including RNN-T, Attention Encoder-Decoder (AED), and Connectionist
Temporal Classification (CTC), RNN-T has some clear advantages given that it
supports streaming recognition and does not have frame-independency assumption.
Although significant progresses have been made for RNN-T research, it is still
facing performance challenges in terms of training speed and accuracy. We
propose attention-based transducer with modification over RNN-T in two aspects.
First, we introduce chunk-wise attention in the joint network. Second,
self-attention is introduced in the encoder. Our proposed model outperforms
RNN-T for both training speed and accuracy. For training, we achieves over 1.7x
speedup. With 500 hours LAIX non-native English training data, attention-based
transducer yields ~10.6% WER reduction over baseline RNN-T. Trained with full
set of over 10K hours data, our final system achieves ~5.5% WER reduction over
that trained with the best Kaldi TDNN-f recipe. After 8-bit weight quantization
without WER degradation, RTF and latency drop to 0.34~0.36 and 268~409
milliseconds respectively on a single CPU core of a production server.
Related papers
- On the Computational Complexity and Formal Hierarchy of Second Order
Recurrent Neural Networks [59.85314067235965]
We extend the theoretical foundation for the $2nd$-order recurrent network ($2nd$ RNN)
We prove there exists a class of a $2nd$ RNN that is Turing-complete with bounded time.
We also demonstrate that $2$nd order RNNs, without memory, outperform modern-day models such as vanilla RNNs and gated recurrent units in recognizing regular grammars.
arXiv Detail & Related papers (2023-09-26T06:06:47Z) - A Time-to-first-spike Coding and Conversion Aware Training for
Energy-Efficient Deep Spiking Neural Network Processor Design [2.850312625505125]
We propose a conversion aware training (CAT) to reduce ANN-to-SNN conversion loss without hardware implementation overhead.
We also present a time-to-first-spike coding that allows lightweight logarithmic by utilizing spike time information.
The computation processor achieves the top-1 accuracies of 91.7%, 67.9% and 57.4% with inference energy of 486.7uJ, 503.6uJ, and 1426uJ.
arXiv Detail & Related papers (2022-08-09T01:46:46Z) - Efficient Spiking Neural Networks with Radix Encoding [35.79325964767678]
Spiking neural networks (SNNs) have advantages in latency and energy efficiency over traditional artificial neural networks (ANNs)
In this paper, we propose a radix encoded SNN with ultra-short spike trains.
Experiments show that our method demonstrates 25X speedup and 1.1% increment on accuracy, compared with the state-of-the-art work on VGG-16 network architecture and CIFAR-10 dataset.
arXiv Detail & Related papers (2021-05-14T16:35:53Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Alignment Restricted Streaming Recurrent Neural Network Transducer [29.218353627837214]
We propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T models.
The Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER)
The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency.
arXiv Detail & Related papers (2020-11-05T19:38:54Z) - Kernel Based Progressive Distillation for Adder Neural Networks [71.731127378807]
Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption.
There is an accuracy drop when replacing all convolution filters by adder filters.
We present a novel method for further improving the performance of ANNs without increasing the trainable parameters.
arXiv Detail & Related papers (2020-09-28T03:29:19Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - Tensor train decompositions on recurrent networks [60.334946204107446]
Matrix product state (MPS) tensor trains have more attractive features than MPOs, in terms of storage reduction and computing time at inference.
We show that MPS tensor trains should be at the forefront of LSTM network compression through a theoretical analysis and practical experiments on NLP task.
arXiv Detail & Related papers (2020-06-09T18:25:39Z) - You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference
to ANN-Level Accuracy [51.861168222799186]
Spiking Neural Networks (SNNs) are a type of neuromorphic, or brain-inspired network.
SNNs are sparse, accessing very few weights, and typically only use addition operations instead of the more power-intensive multiply-and-accumulate operations.
In this work, we aim to overcome the limitations of TTFS-encoded neuromorphic systems.
arXiv Detail & Related papers (2020-06-03T15:55:53Z) - Crossed-Time Delay Neural Network for Speaker Recognition [5.216353911330589]
We introduce a novel structure Crossed-Time Delay Neural Network (CTDNN) to enhance the performance of current TDNN.
The proposed CTDNN gives significant improvements over original TDNN on both speaker verification and identification tasks.
arXiv Detail & Related papers (2020-05-31T06:57:34Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.