DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2403.17645v3
- Date: Thu, 11 Apr 2024 12:07:33 GMT
- Title: DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition
- Authors: Yi-Cheng Wang, Hsin-Wei Wang, Bi-Cheng Yan, Chi-Han Lin, Berlin Chen,
- Abstract summary: We propose a Description Augmented Named entity CorrEctoR (dubbed DANCER) to mitigate phonetic confusion on automatic speech recognition (E2E ASR) transcriptions.
DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities.
More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.
- Score: 10.844822448167935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on phonetic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.
Related papers
- Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR [3.841280537264271]
We propose a light on-the-fly method to improve automatic speech recognition performance.
We combine a bias list of named entities with a word-level n-gram language model with the shallow fusion approach based on the Aho-Corasick string matching algorithm.
We achieve up to 21.6% relative improvement in the general word error rate with no practical difference in the inverse real-time factor.
arXiv Detail & Related papers (2024-09-20T13:53:37Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer [50.572974726351504]
We propose C-FNT, a novel E2E model that incorporates class-based LMs into FNT.
In C-FNT, the LM score of named entities can be associated with the name class instead of its surface form.
The experimental results show that our proposed C-FNT significantly reduces error in named entities without hurting performance in general word recognition.
arXiv Detail & Related papers (2023-09-14T12:14:49Z) - ACLM: A Selective-Denoising based Generative Data Augmentation Approach
for Low-Resource Complex NER [47.32935969127478]
We present ACLM Attention-map aware keyword selection for Conditional Language Model fine-tuning.
ACLM alleviates the context-entity mismatch issue, a problem existing NER data augmentation techniques suffer from.
We demonstrate the effectiveness of ACLM both qualitatively and quantitatively on monolingual, cross-lingual, and multilingual complex NER.
arXiv Detail & Related papers (2023-06-01T17:33:04Z) - Retraining-free Customized ASR for Enharmonic Words Based on a
Named-Entity-Aware Model and Phoneme Similarity Estimation [0.742779257315787]
This paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation.
Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model.
arXiv Detail & Related papers (2023-05-29T02:10:13Z) - A Discriminative Entity-Aware Language Model for Virtual Assistants [4.2854663014000876]
High-quality automatic speech recognition (ASR) is essential for virtual assistants (VAs) to work well.
In this work, we start from the observation that many ASR errors on named entities are inconsistent with real-world knowledge.
We extend previous discriminative n-gram language modeling approaches to incorporate real-world knowledge from a Knowledge Graph.
arXiv Detail & Related papers (2021-06-21T17:50:28Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.