Related papers: Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation

Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation

URL: http://arxiv.org/abs/2508.17796v2
Date: Tue, 02 Sep 2025 02:20:04 GMT
Title: Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation
Authors: Changsong Liu, Yizhou Peng, Eng Siong Chng,
Abstract summary: We propose a synthesis-driven multi-pronunciation contextual biasing method.<n>Our method reduces biased-word error rate (B-WER) by 43% on test-clean and 44% on test-other while maintaining unbiased-WER (U-WER) essentially unchanged.
Score: 38.053484403802834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. Subsequently, any recognized variant is mapped back to the original rare word in the final transcription. The evaluation results on the LibriSpeech dataset show that our method reduces biased-word error rate (B-WER) by 43% on test-clean and 44% on test-other while maintaining unbiased-WER (U-WER) essentially unchanged.

Related papers

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper [33.50962290311746]
Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words.<n>We develop a method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance.
arXiv Detail & Related papers (2026-02-26T06:17:56Z)
Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [56.972851337263755]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z)
Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation [4.314729314139958]
We propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair.<n>We employ a decoding strategy that utilizes dictionary prior knowledge to correct errors in phonemic labeling.<n>The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.
arXiv Detail & Related papers (2025-06-09T11:10:24Z)
Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation [7.455706251115513]
We propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation.<n>Experiments on Librispeech and Wenetspeech datasets demonstrate that our approach achieves relative WER reductions of 28.31% and 23.49% compared to baseline.
arXiv Detail & Related papers (2025-05-29T04:31:33Z)
Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose a self-supervised continual learning approach for Automatic Speech Recognition.<n>We use a memory-enhanced ASR model from the literature to decode new words from the slides.<n>We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z)
Wiki-En-ASR-Adapt: Large-scale synthetic dataset for English ASR Customization [66.22007368434633]
We present a first large-scale public synthetic dataset for contextual spellchecking customization of automatic speech recognition (ASR) The proposed approach allows creating millions of realistic examples of corrupted ASR hypotheses and simulate non-trivial biasing lists for the customization task. We report experiments with training an open-source customization model on the proposed dataset and show that the injection of hard negative biasing phrases decreases WER and the number of false alarms.
arXiv Detail & Related papers (2023-09-29T14:18:59Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing. We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context. Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z)
Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset. We show results for shallow fusion contextual biasing applied to two different decoding algorithms. We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z)
Class LM and word mapping for contextual biasing in End-to-End ASR [4.989480853499918]
In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community. In this paper, we propose an algorithm to train a context aware E2E model and allow the beam search to traverse into the context FST during inference. Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy.
arXiv Detail & Related papers (2020-07-10T20:58:44Z)
Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.