Retraining-free Customized ASR for Enharmonic Words Based on a
Named-Entity-Aware Model and Phoneme Similarity Estimation
- URL: http://arxiv.org/abs/2305.17846v1
- Date: Mon, 29 May 2023 02:10:13 GMT
- Title: Retraining-free Customized ASR for Enharmonic Words Based on a
Named-Entity-Aware Model and Phoneme Similarity Estimation
- Authors: Yui Sudo, Kazuya Hata, Kazuhiro Nakadai
- Abstract summary: This paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation.
Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model.
- Score: 0.742779257315787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end automatic speech recognition (E2E-ASR) has the potential to
improve performance, but a specific issue that needs to be addressed is the
difficulty it has in handling enharmonic words: named entities (NEs) with the
same pronunciation and part of speech that are spelled differently. This often
occurs with Japanese personal names that have the same pronunciation but
different Kanji characters. Since such NE words tend to be important keywords,
ASR easily loses user trust if it misrecognizes them. To solve these problems,
this paper proposes a novel retraining-free customized method for E2E-ASRs
based on a named-entity-aware E2E-ASR model and phoneme similarity estimation.
Experimental results show that the proposed method improves the target NE
character error rate by 35.7% on average relative to the conventional E2E-ASR
model when selecting personal names as a target NE.
Related papers
- DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition [10.844822448167935]
We propose a Description Augmented Named entity CorrEctoR (dubbed DANCER) to mitigate phonetic confusion on automatic speech recognition (E2E ASR) transcriptions.
DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities.
More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.
arXiv Detail & Related papers (2024-03-26T12:27:32Z) - Wiki-En-ASR-Adapt: Large-scale synthetic dataset for English ASR
Customization [66.22007368434633]
We present a first large-scale public synthetic dataset for contextual spellchecking customization of automatic speech recognition (ASR)
The proposed approach allows creating millions of realistic examples of corrupted ASR hypotheses and simulate non-trivial biasing lists for the customization task.
We report experiments with training an open-source customization model on the proposed dataset and show that the injection of hard negative biasing phrases decreases WER and the number of false alarms.
arXiv Detail & Related papers (2023-09-29T14:18:59Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition [63.38229762589485]
We propose a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM.
With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
arXiv Detail & Related papers (2023-02-16T21:07:38Z) - Contextual Density Ratio for Language Model Biasing of Sequence to
Sequence ASR Systems [2.4909170697740963]
We propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities.
Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set.
arXiv Detail & Related papers (2022-06-29T13:12:46Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Learning Word-Level Confidence For Subword End-to-End ASR [48.09713798451474]
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR)
The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
arXiv Detail & Related papers (2021-03-11T15:03:33Z) - Class LM and word mapping for contextual biasing in End-to-End ASR [4.989480853499918]
In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community.
In this paper, we propose an algorithm to train a context aware E2E model and allow the beam search to traverse into the context FST during inference.
Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy.
arXiv Detail & Related papers (2020-07-10T20:58:44Z) - Contextual RNN-T For Open Domain ASR [41.83409885125617]
End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network.
This has some nice advantages, it limits the system to be trained using only paired audio and text.
Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names.
We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
arXiv Detail & Related papers (2020-06-04T04:37:03Z) - End-to-end Named Entity Recognition from English Speech [51.22888702264816]
We introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimize the ASR and NER tagger components.
We also discuss how NER from speech can be used to handle out of vocabulary (OOV) words in an ASR system.
arXiv Detail & Related papers (2020-05-22T13:39:14Z) - Improving Proper Noun Recognition in End-to-End ASR By Customization of
the MWER Loss Criterion [33.043533068435366]
Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems.
Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations.
This paper builds on recent advances in minimum word error rate (MWER) training to develop two new loss criteria that specifically emphasize proper noun recognition.
arXiv Detail & Related papers (2020-05-19T21:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.