CopyNE: Better Contextual ASR by Copying Named Entities
- URL: http://arxiv.org/abs/2305.12839v2
- Date: Mon, 27 May 2024 06:35:36 GMT
- Title: CopyNE: Better Contextual ASR by Copying Named Entities
- Authors: Shilin Zhou, Zhenghua Li, Yu Hong, Min Zhang, Zhefeng Wang, Baoxing Huai,
- Abstract summary: We design a systematic mechanism called CopyNE, which can copy entities from the NE dictionary.
Experiments demonstrate that CopyNE consistently improves the accuracy of transcribing entities compared to previous approaches.
- Score: 35.36208545538822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end automatic speech recognition (ASR) systems have made significant progress in general scenarios. However, it remains challenging to transcribe contextual named entities (NEs) in the contextual ASR scenario. Previous approaches have attempted to address this by utilizing the NE dictionary. These approaches treat entities as individual tokens and generate them token-by-token, which may result in incomplete transcriptions of entities. In this paper, we treat entities as indivisible wholes and introduce the idea of copying into ASR. We design a systematic mechanism called CopyNE, which can copy entities from the NE dictionary. By copying all tokens of an entity at once, we can reduce errors during entity transcription, ensuring the completeness of the entity. Experiments demonstrate that CopyNE consistently improves the accuracy of transcribing entities compared to previous approaches. Even when based on the strong Whisper, CopyNE still achieves notable improvements.
Related papers
- Contextualized Token Discrimination for Speech Search Query Correction [14.096535124540354]
This paper introduces a novel method named Contextualized Token Discrimination (CTD) to conduct effective speech query correction.<n>In CTD, we first employ BERT to generate token-level contextualized representations and then construct a composition layer to enhance semantic information.<n>We produce the correct query according to the aggregated token representation, correcting the incorrect tokens by comparing the original token representations and the contextualized representations.
arXiv Detail & Related papers (2025-09-04T17:04:44Z) - PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation [35.774826781541385]
We propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO)<n>PARCO integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering.<n>Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2.
arXiv Detail & Related papers (2025-09-04T16:18:34Z) - Generative Annotation for ASR Named Entity Correction [22.96005224780927]
End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities.<n>We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities.<n>We test our method using open-source and self-constructed test sets.
arXiv Detail & Related papers (2025-08-28T12:18:35Z) - Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions [5.439020425819001]
We propose a novel training approach that extends the semantic context of ASR models.<n>By sliding 5-second overlaps on both sides of 30-second chunks, we create a 40-second "effective semantic window"<n>We evaluate our method on the Spoken Wikipedia dataset.
arXiv Detail & Related papers (2025-06-28T11:41:36Z) - "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities [59.22329574700317]
Spoken named entity recognition (NER) aims to identify named entities from speech.
New named entities appear every day, however, annotating their Spoken NER data is costly.
We propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs.
arXiv Detail & Related papers (2024-12-26T07:43:18Z) - ReverseNER: A Self-Generated Example-Driven Framework for Zero-Shot Named Entity Recognition with Large Language Models [0.0]
We present ReverseNER, a framework aimed at overcoming the limitations of large language models (LLMs) in zero-shot Named Entity Recognition tasks.
Rather than beginning with sentences, this method uses an LLM to generate entities based on their definitions and then expands them into full sentences.
This results in well-annotated sentences with clearly labeled entities, while preserving semantic and structural similarity to the task sentences.
arXiv Detail & Related papers (2024-11-01T12:08:08Z) - Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models.
Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations.
We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Bypass Temporal Classification: Weakly Supervised Automatic Speech
Recognition with Imperfect Transcripts [44.16141704545044]
We present a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data.
The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora.
arXiv Detail & Related papers (2023-06-01T14:56:19Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge.
The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering.
We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - Non-Autoregressive Machine Translation with Disentangled Context
Transformer [70.95181466892795]
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens.
We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts.
Our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
arXiv Detail & Related papers (2020-01-15T05:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.