Deep Shallow Fusion for RNN-T Personalization
- URL: http://arxiv.org/abs/2011.07754v1
- Date: Mon, 16 Nov 2020 07:13:58 GMT
- Title: Deep Shallow Fusion for RNN-T Personalization
- Authors: Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen,
Michael L. Seltzer
- Abstract summary: We present novel techniques to improve RNN-T's ability to model rare WordPieces.
We show that these combined techniques result in 15.4%-34.5% relative Word Error Rate improvement.
- Score: 22.271012062526463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T)
in particular, have gained significant traction in the automatic speech
recognition community in the last few years due to their simplicity,
compactness, and excellent performance on generic transcription tasks. However,
these models are more challenging to personalize compared to traditional hybrid
systems due to the lack of external language models and difficulties in
recognizing rare long-tail words, specifically entity names. In this work, we
present novel techniques to improve RNN-T's ability to model rare WordPieces,
infuse extra information into the encoder, enable the use of alternative
graphemic pronunciations, and perform deep fusion with personalized language
models for more robust biasing. We show that these combined techniques result
in 15.4%-34.5% relative Word Error Rate improvement compared to a strong RNN-T
baseline which uses shallow fusion and text-to-speech augmentation. Our work
helps push the boundary of RNN-T personalization and close the gap with hybrid
systems on use cases where biasing and entity recognition are crucial.
Related papers
- Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor.
The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin
Speech Recognition with a Syllable-to-Character Converter [10.262490936452688]
This paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T.
By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets.
arXiv Detail & Related papers (2020-11-17T06:42:47Z) - Improved Neural Language Model Fusion for Streaming Recurrent Neural
Network Transducer [28.697119605752643]
Recurrent Neural Network Transducer (RNN-T) has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training.
Previous work has proposed various fusion methods to incorporate external NNLMs into end-to-end ASR to address this weakness.
We propose extensions to these techniques that allow RNN-T to exploit external NNLMs during both training and inference time.
arXiv Detail & Related papers (2020-10-26T20:10:12Z) - Developing RNN-T Models Surpassing High-Performance Hybrid Models with
Customization Capability [46.73349163361723]
Recurrent neural network transducer (RNN-T) is a promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.
We describe our recent development of RNN-T models with reduced GPU memory consumption during training.
We study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios.
arXiv Detail & Related papers (2020-07-30T02:35:20Z) - Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations.
In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z) - Contextual RNN-T For Open Domain ASR [41.83409885125617]
End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network.
This has some nice advantages, it limits the system to be trained using only paired audio and text.
Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names.
We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
arXiv Detail & Related papers (2020-06-04T04:37:03Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z) - Recognizing Long Grammatical Sequences Using Recurrent Networks
Augmented With An External Differentiable Stack [73.48927855855219]
Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction.
RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems.
One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack.
In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms.
arXiv Detail & Related papers (2020-04-04T14:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.