Improving Proper Noun Recognition in End-to-End ASR By Customization of
the MWER Loss Criterion
- URL: http://arxiv.org/abs/2005.09756v1
- Date: Tue, 19 May 2020 21:10:50 GMT
- Title: Improving Proper Noun Recognition in End-to-End ASR By Customization of
the MWER Loss Criterion
- Authors: Cal Peyser, Tara N. Sainath, Golan Pundak
- Abstract summary: Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems.
Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations.
This paper builds on recent advances in minimum word error rate (MWER) training to develop two new loss criteria that specifically emphasize proper noun recognition.
- Score: 33.043533068435366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proper nouns present a challenge for end-to-end (E2E) automatic speech
recognition (ASR) systems in that a particular name may appear only rarely
during training, and may have a pronunciation similar to that of a more common
word. Unlike conventional ASR models, E2E systems lack an explicit
pronounciation model that can be specifically trained with proper noun
pronounciations and a language model that can be trained on a large text-only
corpus. Past work has addressed this issue by incorporating additional training
data or additional models. In this paper, we instead build on recent advances
in minimum word error rate (MWER) training to develop two new loss criteria
that specifically emphasize proper noun recognition. Unlike past work on this
problem, this method requires no new data during training or external models
during inference. We see improvements ranging from 2% to 7% relative on several
relevant benchmarks.
Related papers
- An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition [10.234673954430221]
We study the impact of altering the context list to have words with different frequency distributions on model performance.
A series of experiments conducted on the AISHELL-1 benchmark dataset suggests that using all vocabulary words from the training corpus as the context list and pairing them with our balanced objective yields the best performance.
arXiv Detail & Related papers (2024-09-10T12:52:36Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - The Gift of Feedback: Improving ASR Model Quality by Learning from User
Corrections through Federated Learning [20.643270151774182]
We seek to continually learn from on-device user corrections through Federated Learning (FL)
We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and catastrophic forgetting.
In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution.
arXiv Detail & Related papers (2023-09-29T21:04:10Z) - Retraining-free Customized ASR for Enharmonic Words Based on a
Named-Entity-Aware Model and Phoneme Similarity Estimation [0.742779257315787]
This paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation.
Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model.
arXiv Detail & Related papers (2023-05-29T02:10:13Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Instant One-Shot Word-Learning for Context-Specific Neural
Sequence-to-Sequence Speech Recognition [62.997667081978825]
We present an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize.
arXiv Detail & Related papers (2021-07-05T21:08:34Z) - Learning Word-Level Confidence For Subword End-to-End ASR [48.09713798451474]
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR)
The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
arXiv Detail & Related papers (2021-03-11T15:03:33Z) - Class LM and word mapping for contextual biasing in End-to-End ASR [4.989480853499918]
In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community.
In this paper, we propose an algorithm to train a context aware E2E model and allow the beam search to traverse into the context FST during inference.
Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy.
arXiv Detail & Related papers (2020-07-10T20:58:44Z) - Contextual RNN-T For Open Domain ASR [41.83409885125617]
End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network.
This has some nice advantages, it limits the system to be trained using only paired audio and text.
Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names.
We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
arXiv Detail & Related papers (2020-06-04T04:37:03Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.