Class LM and word mapping for contextual biasing in End-to-End ASR
- URL: http://arxiv.org/abs/2007.05609v3
- Date: Mon, 10 Aug 2020 14:06:41 GMT
- Title: Class LM and word mapping for contextual biasing in End-to-End ASR
- Authors: Rongqing Huang, Ossama Abdel-hamid, Xinwei Li, Gunnar Evermann
- Abstract summary: In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community.
In this paper, we propose an algorithm to train a context aware E2E model and allow the beam search to traverse into the context FST during inference.
Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy.
- Score: 4.989480853499918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid
interest in the speech recognition community. They convert speech input to text
units in a single trainable Neural Network model. In ASR, many utterances
contain rich named entities. Such named entities may be user or location
specific and they are not seen during training. A single model makes it
inflexible to utilize dynamic contextual information during inference. In this
paper, we propose to train a context aware E2E model and allow the beam search
to traverse into the context FST during inference. We also propose a simple
method to adjust the cost discrepancy between the context FST and the base
model. This algorithm is able to reduce the named entity utterance WER by 57%
with little accuracy degradation on regular utterances. Although an E2E model
does not need pronunciation dictionary, it's interesting to make use of
existing pronunciation knowledge to improve accuracy. In this paper, we propose
an algorithm to map the rare entity words to common words via pronunciation and
treat the mapped words as an alternative form to the original word during
recognition. This algorithm further reduces the WER on the named entity
utterances by another 31%.
Related papers
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - Retraining-free Customized ASR for Enharmonic Words Based on a
Named-Entity-Aware Model and Phoneme Similarity Estimation [0.742779257315787]
This paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation.
Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model.
arXiv Detail & Related papers (2023-05-29T02:10:13Z) - DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation
Detection and Correction [1.8322859214908722]
We present a highly-precise, PDA-compatible pronunciation learning framework for the task of TTS mispronunciation detection and correction.
We also propose a novel mispronunciation detection model called DTW-SiameseNet, which employs metric learning with a Siamese architecture for Dynamic Time Warping (DTW) with triplet loss.
Human evaluation shows our proposed approach improves pronunciation accuracy on average by 6% compared to strong phoneme-based and audio-based baselines.
arXiv Detail & Related papers (2023-03-01T01:53:11Z) - Improving Contextual Recognition of Rare Words with an Alternate
Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset.
We show results for shallow fusion contextual biasing applied to two different decoding algorithms.
We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Instant One-Shot Word-Learning for Context-Specific Neural
Sequence-to-Sequence Speech Recognition [62.997667081978825]
We present an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize.
arXiv Detail & Related papers (2021-07-05T21:08:34Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Improving Proper Noun Recognition in End-to-End ASR By Customization of
the MWER Loss Criterion [33.043533068435366]
Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems.
Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations.
This paper builds on recent advances in minimum word error rate (MWER) training to develop two new loss criteria that specifically emphasize proper noun recognition.
arXiv Detail & Related papers (2020-05-19T21:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.