Adaptive Discounting of Implicit Language Models in RNN-Transducers
- URL: http://arxiv.org/abs/2203.02317v1
- Date: Mon, 21 Feb 2022 08:44:56 GMT
- Title: Adaptive Discounting of Implicit Language Models in RNN-Transducers
- Authors: Vinit Unni, Shreya Khare, Ashish Mittal, Preethi Jyothi, Sunita
Sarawagi and Samarth Bharadwaj
- Abstract summary: We show how a lightweight adaptive LM discounting technique can be used with any RNN-T architecture.
We obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
- Score: 33.63456351411599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: RNN-Transducer (RNN-T) models have become synonymous with streaming
end-to-end ASR systems. While they perform competitively on a number of
evaluation categories, rare words pose a serious challenge to RNN-T models. One
main reason for the degradation in performance on rare words is that the
language model (LM) internal to RNN-Ts can become overconfident and lead to
hallucinated predictions that are acoustically inconsistent with the underlying
speech. To address this issue, we propose a lightweight adaptive LM discounting
technique AdaptLMD, that can be used with any RNN-T architecture without
requiring any external resources or additional parameters. AdaptLMD uses a
two-pronged approach: 1) Randomly mask the prediction network output to
encourage the RNN-T to not be overly reliant on it's outputs. 2) Dynamically
choose when to discount the implicit LM (ILM) based on rarity of recently
predicted tokens and divergence between ILM and implicit acoustic model (IAM)
scores. Comparing AdaptLMD to a competitive RNN-T baseline, we obtain up to 4%
and 14% relative reductions in overall WER and rare word PER, respectively, on
a conversational, code-mixed Hindi-English ASR task.
Related papers
- Advancing Regular Language Reasoning in Linear Recurrent Neural Networks [56.11830645258106]
We study whether linear recurrent neural networks (LRNNs) can learn the hidden rules in training sequences.
We propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix.
Experiments suggest that the proposed model is the only LRNN capable of performing length extrapolation on regular language tasks.
arXiv Detail & Related papers (2023-09-14T03:36:01Z) - Multi-blank Transducers for Speech Recognition [49.6154259349501]
In our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted.
We refer to the added symbols as big blanks, and the method multi-blank RNN-T.
With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139%.
arXiv Detail & Related papers (2022-11-04T16:24:46Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on
Riemannian Gradient Descent With Illustrations of Speech Processing [74.31472195046099]
We exploit a low-rank tensor-train deep neural network (TT-DNN) to build an end-to-end deep learning pipeline, namely LR-TT-DNN.
A hybrid model combining LR-TT-DNN with a convolutional neural network (CNN) is set up to boost the performance.
Our empirical evidence demonstrates that the LR-TT-DNN and CNN+(LR-TT-DNN) models with fewer model parameters can outperform the TT-DNN and CNN+(LR-TT-DNN) counterparts.
arXiv Detail & Related papers (2022-03-11T15:55:34Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z) - Distance and Equivalence between Finite State Machines and Recurrent
Neural Networks: Computational results [0.348097307252416]
We show some results related to the problem of extracting Finite State Machine based models from trained RNN Language models.
Our reduction technique from 3-SAT makes this latter fact easily generalizable to other RNN architectures.
arXiv Detail & Related papers (2020-04-01T14:48:59Z) - A Density Ratio Approach to Language Model Fusion in End-To-End
Automatic Speech Recognition [9.184319271887531]
This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR)
An RNN-T ASR model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data.
arXiv Detail & Related papers (2020-02-26T02:53:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.