Improving Large-scale Deep Biasing with Phoneme Features and Text-only
Data in Streaming Transducer
- URL: http://arxiv.org/abs/2311.08966v1
- Date: Wed, 15 Nov 2023 13:53:28 GMT
- Title: Improving Large-scale Deep Biasing with Phoneme Features and Text-only
Data in Streaming Transducer
- Authors: Jin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
- Abstract summary: Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities.
In this paper, we combine the phoneme and textual information of rare words in Transducers to distinguish words with similar pronunciation or spelling.
Experiments on the LibriSpeech corpus demonstrate that the proposed method achieves state-of-the-art performance on rare word error rate for different scales and levels of bias lists.
- Score: 23.70253642540094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep biasing for the Transducer can improve the recognition performance of
rare words or contextual entities, which is essential in practical
applications, especially for streaming Automatic Speech Recognition (ASR).
However, deep biasing with large-scale rare words remains challenging, as the
performance drops significantly when more distractors exist and there are words
with similar grapheme sequences in the bias list. In this paper, we combine the
phoneme and textual information of rare words in Transducers to distinguish
words with similar pronunciation or spelling. Moreover, the introduction of
training with text-only data containing more rare words benefits large-scale
deep biasing. The experiments on the LibriSpeech corpus demonstrate that the
proposed method achieves state-of-the-art performance on rare word error rate
for different scales and levels of bias lists.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn
Medical Interview [26.823126615724888]
End-to-end (E2E) approach is gradually replacing hybrid models for automatic speech recognition (ASR) tasks.
We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions.
In our experiments, for subsets of rare words appearing in the training speech between 10 and 20 times, the proposed method achieves a relative improvement of 9.3% and 5.1%, respectively.
arXiv Detail & Related papers (2024-03-01T08:53:52Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - A Few Shot Multi-Representation Approach for N-gram Spotting in
Historical Manuscripts [1.2930503923129213]
We propose a few-shot learning paradigm for spotting sequences of a few characters (N-gram)
We exhibit that recognition of important n-grams could reduce the system's dependency on vocabulary.
arXiv Detail & Related papers (2022-09-21T15:35:02Z) - Improving Contextual Recognition of Rare Words with an Alternate
Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset.
We show results for shallow fusion contextual biasing applied to two different decoding algorithms.
We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z) - On Guiding Visual Attention with Language Specification [76.08326100891571]
We use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors.
We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data.
arXiv Detail & Related papers (2022-02-17T22:40:19Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.