HMM-Free Encoder Pre-Training for Streaming RNN Transducer
- URL: http://arxiv.org/abs/2104.10764v2
- Date: Fri, 11 Jun 2021 03:11:39 GMT
- Title: HMM-Free Encoder Pre-Training for Streaming RNN Transducer
- Authors: Lu Huang, Jingyu Sun, Yufeng Tang, Junfeng Hou, Jinkun Chen, Jun
Zhang, Zejun Ma
- Abstract summary: This work describes an encoder pre-training procedure using frame-wise label to improve the training of streaming recurrent neural network transducer (RNN-T) model.
To our best knowledge, this is the first work to simulate HMM-based frame-wise label using CTC model for pre-training.
- Score: 9.970995981222993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work describes an encoder pre-training procedure using frame-wise label
to improve the training of streaming recurrent neural network transducer
(RNN-T) model. Streaming RNN-T trained from scratch usually performs worse than
non-streaming RNN-T. Although it is common to address this issue through
pre-training components of RNN-T with other criteria or frame-wise alignment
guidance, the alignment is not easily available in end-to-end manner. In this
work, frame-wise alignment, used to pre-train streaming RNN-T's encoder, is
generated without using a HMM-based system. Therefore an all-neural framework
equipping HMM-free encoder pre-training is constructed. This is achieved by
expanding the spikes of CTC model to their left/right blank frames, and two
expanding strategies are proposed. To our best knowledge, this is the first
work to simulate HMM-based frame-wise label using CTC model for pre-training.
Experiments conducted on LibriSpeech and MLS English tasks show the proposed
pre-training procedure, compared with random initialization, reduces the WER by
relatively 5%~11% and the emission latency by 60 ms. Besides, the method is
lexicon-free, so it is friendly to new languages without manually designed
lexicon.
Related papers
- Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Multi-blank Transducers for Speech Recognition [49.6154259349501]
In our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted.
We refer to the added symbols as big blanks, and the method multi-blank RNN-T.
With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139%.
arXiv Detail & Related papers (2022-11-04T16:24:46Z) - Accelerating RNN-T Training and Inference Using CTC guidance [18.776997761704784]
The proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER)
We demonstrate that the proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER)
arXiv Detail & Related papers (2022-10-29T03:39:18Z) - Deliberation of Streaming RNN-Transducer by Non-autoregressive Decoding [21.978994865937786]
The method performs a few refinement steps, where each step shares a transformer decoder that attends to both text features and audio features.
We show that, conditioned on hypothesis alignments of a streaming RNN-T model, our method obtains significantly more accurate recognition results than the first-pass RNN-T.
arXiv Detail & Related papers (2021-12-01T01:34:28Z) - Low-bit Quantization of Recurrent Neural Network Language Models Using
Alternating Direction Methods of Multipliers [67.688697838109]
This paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM)
Experiments on two tasks suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs.
arXiv Detail & Related papers (2021-11-29T09:30:06Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid
Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems.
We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z) - Spike-inspired Rank Coding for Fast and Accurate Recurrent Neural
Networks [5.986408771459261]
Biological spiking neural networks (SNNs) can temporally encode information in their outputs, whereas artificial neural networks (ANNs) conventionally do not.
Here we show that temporal coding such as rank coding (RC) inspired by SNNs can also be applied to conventional ANNs such as LSTMs.
RC-training also significantly reduces time-to-insight during inference, with a minimal decrease in accuracy.
We demonstrate these in two toy problems of sequence classification, and in a temporally-encoded MNIST dataset where our RC model achieves 99.19% accuracy after the first input time-step
arXiv Detail & Related papers (2021-10-06T15:51:38Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.