Related papers: Class LM and word mapping for contextual biasing in End-to-End ASR

Class LM and word mapping for contextual biasing in End-to-End ASR

URL: http://arxiv.org/abs/2007.05609v3
Date: Mon, 10 Aug 2020 14:06:41 GMT
Title: Class LM and word mapping for contextual biasing in End-to-End ASR
Authors: Rongqing Huang, Ossama Abdel-hamid, Xinwei Li, Gunnar Evermann
Abstract summary: In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community. In this paper, we propose an algorithm to train a context aware E2E model and allow the beam search to traverse into the context FST during inference. Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy.
Score: 4.989480853499918
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community. They convert speech input to text units in a single trainable Neural Network model. In ASR, many utterances contain rich named entities. Such named entities may be user or location specific and they are not seen during training. A single model makes it inflexible to utilize dynamic contextual information during inference. In this paper, we propose to train a context aware E2E model and allow the beam search to traverse into the context FST during inference. We also propose a simple method to adjust the cost discrepancy between the context FST and the base model. This algorithm is able to reduce the named entity utterance WER by 57% with little accuracy degradation on regular utterances. Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy. In this paper, we propose an algorithm to map the rare entity words to common words via pronunciation and treat the mapped words as an alternative form to the original word during recognition. This algorithm further reduces the WER on the named entity utterances by another 31%.

Related papers

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words. We use a memory-enhanced Automatic Speech Recognition model from previous work. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z)
SpellMapper: A non-autoregressive neural spellchecker for ASR customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition. We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings. Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z)
Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation [0.742779257315787]
This paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation. Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model.
arXiv Detail & Related papers (2023-05-29T02:10:13Z)
DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction [1.8322859214908722]
We present a highly-precise, PDA-compatible pronunciation learning framework for the task of TTS mispronunciation detection and correction. We also propose a novel mispronunciation detection model called DTW-SiameseNet, which employs metric learning with a Siamese architecture for Dynamic Time Warping (DTW) with triplet loss. Human evaluation shows our proposed approach improves pronunciation accuracy on average by 6% compared to strong phoneme-based and audio-based baselines.
arXiv Detail & Related papers (2023-03-01T01:53:11Z)
Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset. We show results for shallow fusion contextual biasing applied to two different decoding algorithms. We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z)
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems. We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z)
Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly. We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z)
Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition [62.997667081978825]
We present an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly. In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize.
arXiv Detail & Related papers (2021-07-05T21:08:34Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Improving Proper Noun Recognition in End-to-End ASR By Customization of the MWER Loss Criterion [33.043533068435366]
Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems. Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations. This paper builds on recent advances in minimum word error rate (MWER) training to develop two new loss criteria that specifically emphasize proper noun recognition.
arXiv Detail & Related papers (2020-05-19T21:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.