Related papers: Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

URL: http://arxiv.org/abs/2311.08966v1
Date: Wed, 15 Nov 2023 13:53:28 GMT
Title: Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Authors: Jin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
Abstract summary: Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities. In this paper, we combine the phoneme and textual information of rare words in Transducers to distinguish words with similar pronunciation or spelling. Experiments on the LibriSpeech corpus demonstrate that the proposed method achieves state-of-the-art performance on rare word error rate for different scales and levels of bias lists.
Score: 23.70253642540094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities, which is essential in practical applications, especially for streaming Automatic Speech Recognition (ASR). However, deep biasing with large-scale rare words remains challenging, as the performance drops significantly when more distractors exist and there are words with similar grapheme sequences in the bias list. In this paper, we combine the phoneme and textual information of rare words in Transducers to distinguish words with similar pronunciation or spelling. Moreover, the introduction of training with text-only data containing more rare words benefits large-scale deep biasing. The experiments on the LibriSpeech corpus demonstrate that the proposed method achieves state-of-the-art performance on rare word error rate for different scales and levels of bias lists.

Related papers

Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [56.972851337263755]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z)
PHRASED: Phrase Dictionary Biasing for Speech Translation [41.03459069364749]
We propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language.<n>We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model.
arXiv Detail & Related papers (2025-06-10T18:42:38Z)
Learning Speaker-Invariant Visual Features for Lipreading [54.670614643480505]
Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text.<n>Existing lipreading methods often extract speaker-specific lip attributes that introduce spurious correlations between vision and text.<n>We introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes.
arXiv Detail & Related papers (2025-06-09T09:16:14Z)
WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing [5.50485371072671]
We propose a method to improve recognition accuracy of rare words in CTC-based models without additional training or text-to-speech systems.<n>For keyword detection, we adopt a wildcard CTC that is both fast and tolerant of ambiguous matches.<n>In experiments on Japanese speech recognition, the proposed method achieved a 29% improvement in the F1 score for unknown words.
arXiv Detail & Related papers (2025-06-02T02:30:26Z)
VALLR: Visual ASR Language Model for Lip Reading [28.561566996686484]
Lip Reading, or Visual Automatic Speech Recognition, is a complex task requiring the interpretation of spoken language exclusively from visual cues. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences.
arXiv Detail & Related papers (2025-03-27T11:52:08Z)
Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs. We propose a more realistic setting in which only noisy text and its NER labels are available. We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z)
Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview [26.823126615724888]
End-to-end (E2E) approach is gradually replacing hybrid models for automatic speech recognition (ASR) tasks. We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. In our experiments, for subsets of rare words appearing in the training speech between 10 and 20 times, the proposed method achieves a relative improvement of 9.3% and 5.1%, respectively.
arXiv Detail & Related papers (2024-03-01T08:53:52Z)
Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z)
Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z)
Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing. We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context. Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z)
A Few Shot Multi-Representation Approach for N-gram Spotting in Historical Manuscripts [1.2930503923129213]
We propose a few-shot learning paradigm for spotting sequences of a few characters (N-gram) We exhibit that recognition of important n-grams could reduce the system's dependency on vocabulary.
arXiv Detail & Related papers (2022-09-21T15:35:02Z)
Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset. We show results for shallow fusion contextual biasing applied to two different decoding algorithms. We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z)
On Guiding Visual Attention with Language Specification [76.08326100891571]
We use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors. We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data.
arXiv Detail & Related papers (2022-02-17T22:40:19Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)
UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger. We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences. We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.