Improving Contextual Recognition of Rare Words with an Alternate
Spelling Prediction Model
- URL: http://arxiv.org/abs/2209.01250v1
- Date: Fri, 2 Sep 2022 19:30:16 GMT
- Title: Improving Contextual Recognition of Rare Words with an Alternate
Spelling Prediction Model
- Authors: Jennifer Drexler Fox, Natalie Delworth
- Abstract summary: We release contextual biasing lists to accompany the Earnings21 dataset.
We show results for shallow fusion contextual biasing applied to two different decoding algorithms.
We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Contextual ASR, which takes a list of bias terms as input along with audio,
has drawn recent interest as ASR use becomes more widespread. We are releasing
contextual biasing lists to accompany the Earnings21 dataset, creating a public
benchmark for this task. We present baseline results on this benchmark using a
pretrained end-to-end ASR model from the WeNet toolkit. We show results for
shallow fusion contextual biasing applied to two different decoding algorithms.
Our baseline results confirm observations that end-to-end models struggle in
particular with words that are rarely or never seen during training, and that
existing shallow fusion techniques do not adequately address this problem. We
propose an alternate spelling prediction model that improves recall of rare
words by 34.7% relative and of out-of-vocabulary words by 97.2% relative,
compared to contextual biasing without alternate spellings. This model is
conceptually similar to ones used in prior work, but is simpler to implement as
it does not rely on either a pronunciation dictionary or an existing
text-to-speech system.
Related papers
- LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR [3.841280537264271]
We propose a light on-the-fly method to improve automatic speech recognition performance.
We combine a bias list of named entities with a word-level n-gram language model with the shallow fusion approach based on the Aho-Corasick string matching algorithm.
We achieve up to 21.6% relative improvement in the general word error rate with no practical difference in the inverse real-time factor.
arXiv Detail & Related papers (2024-09-20T13:53:37Z) - An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition [10.234673954430221]
We study the impact of altering the context list to have words with different frequency distributions on model performance.
A series of experiments conducted on the AISHELL-1 benchmark dataset suggests that using all vocabulary words from the training corpus as the context list and pairing them with our balanced objective yields the best performance.
arXiv Detail & Related papers (2024-09-10T12:52:36Z) - Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Class LM and word mapping for contextual biasing in End-to-End ASR [4.989480853499918]
In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community.
In this paper, we propose an algorithm to train a context aware E2E model and allow the beam search to traverse into the context FST during inference.
Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy.
arXiv Detail & Related papers (2020-07-10T20:58:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.