Accelerating RNN-T Training and Inference Using CTC guidance
- URL: http://arxiv.org/abs/2210.16481v1
- Date: Sat, 29 Oct 2022 03:39:18 GMT
- Title: Accelerating RNN-T Training and Inference Using CTC guidance
- Authors: Yongqiang Wang, Zhehuai Chen, Chengjian Zheng, Yu Zhang, Wei Han,
Parisa Haghani
- Abstract summary: The proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER)
We demonstrate that the proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER)
- Score: 18.776997761704784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel method to accelerate training and inference process of
recurrent neural network transducer (RNN-T) based on the guidance from a
co-trained connectionist temporal classification (CTC) model. We made a key
assumption that if an encoder embedding frame is classified as a blank frame by
the CTC model, it is likely that this frame will be aligned to blank for all
the partial alignments or hypotheses in RNN-T and it can be discarded from the
decoder input. We also show that this frame reduction operation can be applied
in the middle of the encoder, which result in significant speed up for the
training and inference in RNN-T. We further show that the CTC alignment, a
by-product of the CTC decoder, can also be used to perform lattice reduction
for RNN-T during training. Our method is evaluated on the Librispeech and
SpeechStew tasks. We demonstrate that the proposed method is able to accelerate
the RNN-T inference by 2.2 times with similar or slightly better word error
rates (WER).
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Blank-regularized CTC for Frame Skipping in Neural Transducer [33.08565763267876]
This paper proposes two novel regularization methods to explicitly encourage more blanks by constraining the self-loop of non-blank symbols in the CTC.
Experiments on LibriSpeech corpus show that our proposed method accelerates the inference of neural Transducer by 4 times without sacrificing performance.
arXiv Detail & Related papers (2023-05-19T09:56:09Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid
Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems.
We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z) - HMM-Free Encoder Pre-Training for Streaming RNN Transducer [9.970995981222993]
This work describes an encoder pre-training procedure using frame-wise label to improve the training of streaming recurrent neural network transducer (RNN-T) model.
To our best knowledge, this is the first work to simulate HMM-based frame-wise label using CTC model for pre-training.
arXiv Detail & Related papers (2021-04-02T16:14:11Z) - Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - AIN: Fast and Accurate Sequence Labeling with Approximate Inference
Network [75.44925576268052]
The linear-chain Conditional Random Field (CRF) model is one of the most widely-used neural sequence labeling approaches.
Exact probabilistic inference algorithms are typically applied in training and prediction stages of the CRF model.
We propose to employ a parallelizable approximate variational inference algorithm for the CRF model.
arXiv Detail & Related papers (2020-09-17T12:18:43Z) - Effect of Architectures and Training Methods on the Performance of
Learned Video Frame Prediction [10.404162481860634]
Experimental results show that the residual FCNN architecture performs the best in terms of peak signal to noise ratio (PSNR) at the expense of higher training and test (inference) computational complexity.
The CRNN can be trained stably and very efficiently using the stateful truncated backpropagation through time procedure.
arXiv Detail & Related papers (2020-08-13T20:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.