Improving Named Entity Transcription with Contextual LLM-based Revision
- URL: http://arxiv.org/abs/2506.10779v1
- Date: Thu, 12 Jun 2025 14:53:48 GMT
- Title: Improving Named Entity Transcription with Contextual LLM-based Revision
- Authors: Viet Anh Trinh, Xinlu He, Jacob Whitehill,
- Abstract summary: We introduce a large language model (LLM) revision mechanism to revise incorrect named entities in automatic speech recognition predictions.<n>Our proposed technique achieves up to 30% relative WER reduction for named entities.
- Score: 14.078146578977599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM's reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.
Related papers
- ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark [28.28891500803133]
We propose ContextASR-Bench to assess the linguistic competence of Automatic Speech Recognition systems.<n>It encompasses up to 40,000 data entries with more than 300,000 named entities across over 10 domains.<n>Extensive evaluation shows LALMs outperform conventional ASR models by a large margin thanks to the strong world knowledge and context modeling of LLMs.
arXiv Detail & Related papers (2025-07-08T07:21:20Z) - Customizing Speech Recognition Model with Large Language Model Feedback [5.290365603660415]
We propose a reinforcement learning based approach for unsupervised domain adaptation.<n>We leverage unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch.<n>Our method achieves a 21% improvement on entity word error rate over conventional self-training methods.
arXiv Detail & Related papers (2025-06-05T18:42:57Z) - LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context [4.444835399672951]
We propose a novel GER approach that targets rare words and incorporates phonetic information.<n> Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER.
arXiv Detail & Related papers (2025-05-23T02:54:52Z) - Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration [0.8702432681310401]
We investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system.<n>Our analysis reveals that the LLM contributes significantly to improvements in rare word error rate (R-WER)<n>Through extensive ablation studies, we highlight the importance of adapter integration in aligning speech encoder outputs with the LLM's linguistic capabilities.
arXiv Detail & Related papers (2025-02-22T08:30:38Z) - "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities [59.22329574700317]
Spoken named entity recognition (NER) aims to identify named entities from speech.<n>New named entities appear every day, however, annotating their Spoken NER data is costly.<n>We propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs.
arXiv Detail & Related papers (2024-12-26T07:43:18Z) - Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose a self-supervised continual learning approach for Automatic Speech Recognition.<n>We use a memory-enhanced ASR model from the literature to decode new words from the slides.<n>We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - ACLM: A Selective-Denoising based Generative Data Augmentation Approach
for Low-Resource Complex NER [47.32935969127478]
We present ACLM Attention-map aware keyword selection for Conditional Language Model fine-tuning.
ACLM alleviates the context-entity mismatch issue, a problem existing NER data augmentation techniques suffer from.
We demonstrate the effectiveness of ACLM both qualitatively and quantitatively on monolingual, cross-lingual, and multilingual complex NER.
arXiv Detail & Related papers (2023-06-01T17:33:04Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Contextual RNN-T For Open Domain ASR [41.83409885125617]
End-to-end (E2E) systems for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR system into a single neural network.
This has some nice advantages, it limits the system to be trained using only paired audio and text.
Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names.
We propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words.
arXiv Detail & Related papers (2020-06-04T04:37:03Z) - Interpretability Analysis for Named Entity Recognition to Understand
System Predictions and How They Can Improve [49.878051587667244]
We examine the performance of several variants of LSTM-CRF architectures for named entity recognition.
We find that context representations do contribute to system performance, but that the main factor driving high performance is learning the name tokens themselves.
We enlist human annotators to evaluate the feasibility of inferring entity types from the context alone and find that, while people are not able to infer the entity type either for the majority of the errors made by the context-only system, there is some room for improvement.
arXiv Detail & Related papers (2020-04-09T14:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.