Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions
- URL: http://arxiv.org/abs/2506.22858v1
- Date: Sat, 28 Jun 2025 11:41:36 GMT
- Title: Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions
- Authors: Duygu Altinok,
- Abstract summary: We propose a novel training approach that extends the semantic context of ASR models.<n>By sliding 5-second overlaps on both sides of 30-second chunks, we create a 40-second "effective semantic window"<n>We evaluate our method on the Spoken Wikipedia dataset.
- Score: 5.439020425819001
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems, such as Whisper, achieve high transcription accuracy but struggle with named entities and numerical data, especially when proper formatting is required. These issues increase word error rate (WER) and impair semantic understanding in critical domains like legal, financial, and medical applications. We propose a novel training approach that extends the semantic context of ASR models by adding overlapping context windows during training. By sliding 5-second overlaps on both sides of 30-second chunks, we create a 40-second "effective semantic window," improving entity recognition and formatting while focusing predictions on the central 30 seconds. To address entities spanning chunk boundaries, we reassign such entities entirely to the right-hand chunk, ensuring proper formatting. Additionally, enriched training data with embedded entity labels enables the model to learn both recognition and type-specific formatting. Evaluated on the Spoken Wikipedia dataset, our method improves performance across semantic tasks, including named entity recognition (NER) and entity formatting. These results highlight the effectiveness of context-aware training in addressing ASR limitations for long-form transcription and complex entity recognition tasks.
Related papers
- Consistency-Aware Editing for Entity-level Unlearning in Language Models [53.522931419965424]
We introduce a novel consistency-aware editing (CAE) framework for entity-level unlearning.<n>CAE aggregates a diverse set of prompts related to a target entity, including its attributes, relations, and adversarial paraphrases.<n>It then jointly learns a low-rank update guided by a consistency regularizer that aligns the editing directions across prompts.
arXiv Detail & Related papers (2025-12-19T15:18:07Z) - Generative Annotation for ASR Named Entity Correction [22.96005224780927]
End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities.<n>We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities.<n>We test our method using open-source and self-constructed test sets.
arXiv Detail & Related papers (2025-08-28T12:18:35Z) - Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts [5.439020425819001]
We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper.<n>Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA.
arXiv Detail & Related papers (2025-08-18T21:37:09Z) - Improving Named Entity Transcription with Contextual LLM-based Revision [14.078146578977599]
We introduce a large language model (LLM) revision mechanism to revise incorrect named entities in automatic speech recognition predictions.<n>Our proposed technique achieves up to 30% relative WER reduction for named entities.
arXiv Detail & Related papers (2025-06-12T14:53:48Z) - Customizing Speech Recognition Model with Large Language Model Feedback [5.290365603660415]
We propose a reinforcement learning based approach for unsupervised domain adaptation.<n>We leverage unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch.<n>Our method achieves a 21% improvement on entity word error rate over conventional self-training methods.
arXiv Detail & Related papers (2025-06-05T18:42:57Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Bypass Temporal Classification: Weakly Supervised Automatic Speech
Recognition with Imperfect Transcripts [44.16141704545044]
We present a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data.
The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora.
arXiv Detail & Related papers (2023-06-01T14:56:19Z) - Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation [80.12316877964558]
High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
arXiv Detail & Related papers (2022-10-27T07:23:08Z) - Cross-domain Speech Recognition with Unsupervised Character-level
Distribution Matching [60.8427677151492]
We propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains.
Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR.
arXiv Detail & Related papers (2021-04-15T14:36:54Z) - Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge.
The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering.
We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z) - End-to-End Spoken Language Understanding Without Full Transcripts [38.19173637496798]
We develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities.
We create two types of such speech-to-entities models, a CTC model and an attention-based encoder-decoder model.
For our speech-to-entities experiments on the ATIS corpus, both the CTC and attention models showed impressive ability to skip non-entity words.
arXiv Detail & Related papers (2020-09-30T01:54:13Z) - ConCET: Entity-Aware Topic Classification for Open-Domain Conversational
Agents [9.870634472479571]
We introduce ConCET: a Concurrent Entity-aware conversational Topic classifier.
We propose a simple and effective method for generating synthetic training data.
We evaluate ConCET on a large dataset of human-machine conversations with real users, collected as part of the Amazon Alexa Prize.
arXiv Detail & Related papers (2020-05-28T06:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.