Related papers: Adaptive Discounting of Implicit Language Models in RNN-Transducers

Adaptive Discounting of Implicit Language Models in RNN-Transducers

URL: http://arxiv.org/abs/2203.02317v1
Date: Mon, 21 Feb 2022 08:44:56 GMT
Title: Adaptive Discounting of Implicit Language Models in RNN-Transducers
Authors: Vinit Unni, Shreya Khare, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi and Samarth Bharadwaj
Abstract summary: We show how a lightweight adaptive LM discounting technique can be used with any RNN-T architecture. We obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
Score: 33.63456351411599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM discounting technique AdaptLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. AdaptLMD uses a two-pronged approach: 1) Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it's outputs. 2) Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing AdaptLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.

Related papers

Advancing Regular Language Reasoning in Linear Recurrent Neural Networks [56.11830645258106]
We study whether linear recurrent neural networks (LRNNs) can learn the hidden rules in training sequences. We propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN capable of performing length extrapolation on regular language tasks.
arXiv Detail & Related papers (2023-09-14T03:36:01Z)
Multi-blank Transducers for Speech Recognition [49.6154259349501]
In our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139%.
arXiv Detail & Related papers (2022-11-04T16:24:46Z)
Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z)
Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing [74.31472195046099]
We exploit a low-rank tensor-train deep neural network (TT-DNN) to build an end-to-end deep learning pipeline, namely LR-TT-DNN. A hybrid model combining LR-TT-DNN with a convolutional neural network (CNN) is set up to boost the performance. Our empirical evidence demonstrates that the LR-TT-DNN and CNN+(LR-TT-DNN) models with fewer model parameters can outperform the TT-DNN and CNN+(LR-TT-DNN) counterparts.
arXiv Detail & Related papers (2022-03-11T15:55:34Z)
On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data. We obtain word-level confidence scores by utilizing several types of features calculated during decoding. The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z)
DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc. This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z)
Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research. In this work, we leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)
Distance and Equivalence between Finite State Machines and Recurrent Neural Networks: Computational results [0.348097307252416]
We show some results related to the problem of extracting Finite State Machine based models from trained RNN Language models. Our reduction technique from 3-SAT makes this latter fact easily generalizable to other RNN architectures.
arXiv Detail & Related papers (2020-04-01T14:48:59Z)
A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition [9.184319271887531]
This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR) An RNN-T ASR model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data.
arXiv Detail & Related papers (2020-02-26T02:53:42Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.