On Addressing Practical Challenges for RNN-Transduce
- URL: http://arxiv.org/abs/2105.00858v1
- Date: Tue, 27 Apr 2021 23:31:43 GMT
- Title: On Addressing Practical Challenges for RNN-Transduce
- Authors: Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong
- Abstract summary: We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
- Score: 72.72132048437751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, several works are proposed to address practical challenges for
deploying RNN Transducer (RNN-T) based speech recognition system. These
challenges are adapting a well-trained RNN-T model to a new domain without
collecting the audio data, obtaining time stamps and confidence scores at word
level. The first challenge is solved with a splicing data method which
concatenates the speech segments extracted from the source domain data. To get
the time stamp, a phone prediction branch is added to the RNN-T model by
sharing the encoder for the purpose of force alignment. Finally, we obtain
word-level confidence scores by utilizing several types of features calculated
during decoding and from confusion network. Evaluated with Microsoft production
data, the splicing data adaptation method improves the baseline and adaption
with the text to speech method by 58.03% and 15.25% relative word error rate
reduction, respectively. The proposed time stamping method can get less than
50ms word timing difference on average while maintaining the recognition
accuracy of the RNN-T model. We also obtain high confidence annotation
performance with limited computation cost
Related papers
- Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Fast Entropy-Based Methods of Word-Level Confidence Estimation for
End-To-End Automatic Speech Recognition [86.21889574126878]
We show how per-frame entropy values can be normalized and aggregated to obtain a confidence measure per unit and per word.
We evaluate the proposed confidence measures on LibriSpeech test sets, and show that they are up to 2 and 4 times better than confidence estimation based on the maximum per-frame probability.
arXiv Detail & Related papers (2022-12-16T20:27:40Z) - Attention-based Feature Compression for CNN Inference Offloading in Edge
Computing [93.67044879636093]
This paper studies the computational offloading of CNN inference in device-edge co-inference systems.
We propose a novel autoencoder-based CNN architecture (AECNN) for effective feature extraction at end-device.
Experiments show that AECNN can compress the intermediate data by more than 256x with only about 4% accuracy loss.
arXiv Detail & Related papers (2022-11-24T18:10:01Z) - Streaming End-to-End Multilingual Speech Recognition with Joint Language
Identification [14.197869575012925]
We propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor.
RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context.
Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER
arXiv Detail & Related papers (2022-09-13T15:10:41Z) - Improving the fusion of acoustic and text representations in RNN-T [35.43599666228086]
We propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations.
We show that the joint use of the proposed methods can result in 4%--5% relative word error rate reductions with only a few million extra parameters.
arXiv Detail & Related papers (2022-01-25T11:20:50Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Adaptive Nearest Neighbor Machine Translation [60.97183408140499]
kNN-MT combines pre-trained neural machine translation with token-level k-nearest-neighbor retrieval.
Traditional kNN algorithm simply retrieves a same number of nearest neighbors for each target token.
We propose Adaptive kNN-MT to dynamically determine the number of k for each target token.
arXiv Detail & Related papers (2021-05-27T09:27:42Z) - Optimize what matters: Training DNN-HMM Keyword Spotting Model Using End
Metric [21.581361079189563]
Deep Neural Network--Hidden Markov Model (DNN-HMM) based methods have been successfully used for many always-on keyword spotting algorithms.
We present a novel end-to-end training strategy that learns the DNN parameters by optimizing for the detection score.
Our method does not require any change in the model architecture or the inference framework.
arXiv Detail & Related papers (2020-11-02T17:47:21Z) - Efficient minimum word error rate training of RNN-Transducer for
end-to-end speech recognition [21.65651608697333]
We propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T)
In our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists.
The hypothesis probability scores and back-propagated gradients are calculated efficiently using the forward-backward algorithm.
arXiv Detail & Related papers (2020-07-27T18:33:35Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.