PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
- URL: http://arxiv.org/abs/2509.04357v1
- Date: Thu, 04 Sep 2025 16:18:34 GMT
- Title: PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
- Authors: Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda,
- Abstract summary: We propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO)<n>PARCO integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering.<n>Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2.
- Score: 35.774826781541385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.
Related papers
- Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction [12.483998165719981]
We propose a retrieval-augmented generation framework for correcting named entity errors in automatic speech recognition (ASR)<n>Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty.
arXiv Detail & Related papers (2026-01-21T15:05:39Z) - Index-MSR: A high-efficiency multimodal fusion framework for speech recognition [7.677016652056559]
Index-MSR is an efficient multimodal speech recognition framework.<n>MFD effectively incorporates text-related information from videos into the speech recognition.<n>We show that Index-MSR achieves sota accuracy, with substitution errors reduced by 20,50%.
arXiv Detail & Related papers (2025-09-26T03:47:15Z) - Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts [5.439020425819001]
We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper.<n>Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA.
arXiv Detail & Related papers (2025-08-18T21:37:09Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition [10.844822448167935]
We propose a Description Augmented Named entity CorrEctoR (dubbed DANCER) to mitigate phonetic confusion on automatic speech recognition (E2E ASR) transcriptions.
DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities.
More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.
arXiv Detail & Related papers (2024-03-26T12:27:32Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - CopyNE: Better Contextual ASR by Copying Named Entities [35.36208545538822]
We design a systematic mechanism called CopyNE, which can copy entities from the NE dictionary.
Experiments demonstrate that CopyNE consistently improves the accuracy of transcribing entities compared to previous approaches.
arXiv Detail & Related papers (2023-05-22T09:03:11Z) - BERM: Training the Balanced and Extractable Representation for Matching
to Improve Generalization Ability of Dense Retrieval [54.66399120084227]
We propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM.
Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets.
arXiv Detail & Related papers (2023-05-18T15:43:09Z) - Probing Linguistic Features of Sentence-Level Representations in Neural
Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE)
We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets.
We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.