Adaptive Contextual Biasing for Transducer Based Streaming Speech
Recognition
- URL: http://arxiv.org/abs/2306.00804v3
- Date: Tue, 15 Aug 2023 04:36:14 GMT
- Title: Adaptive Contextual Biasing for Transducer Based Streaming Speech
Recognition
- Authors: Tianyi Xu, Zhanheng Yang, Kaixun Huang, Pengcheng Guo, Ao Zhang, Biao
Li, Changru Chen, Chao Li, Lei Xie
- Abstract summary: deep biasing methods have emerged as a promising solution for speech recognition of personalized words.
For real-world voice assistants, always biasing on such words with high prediction scores can significantly degrade the performance of recognizing common words.
We propose an adaptive contextual bias based Context-Aware Transformer (CATT) that utilizes the biased encoder and predictors to perform streaming prediction of contextual phrase occurrences.
- Score: 21.90433428015086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: By incorporating additional contextual information, deep biasing methods have
emerged as a promising solution for speech recognition of personalized words.
However, for real-world voice assistants, always biasing on such personalized
words with high prediction scores can significantly degrade the performance of
recognizing common words. To address this issue, we propose an adaptive
contextual biasing method based on Context-Aware Transformer Transducer (CATT)
that utilizes the biased encoder and predictor embeddings to perform streaming
prediction of contextual phrase occurrences. Such prediction is then used to
dynamically switch the bias list on and off, enabling the model to adapt to
both personalized and common scenarios. Experiments on Librispeech and internal
voice assistant datasets show that our approach can achieve up to 6.7% and
20.7% relative reduction in WER and CER compared to the baseline respectively,
mitigating up to 96.7% and 84.9% of the relative WER and CER increase for
common cases. Furthermore, our approach has a minimal performance impact in
personalized scenarios while maintaining a streaming inference pipeline with
negligible RTF increase.
Related papers
- Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss [44.94458898538114]
Using explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives.
Our proposed intermediate biasing loss brings more regularization and contextualization to the network.
arXiv Detail & Related papers (2024-06-23T14:22:59Z) - Contextualized Automatic Speech Recognition with Attention-Based Bias
Phrase Boosted Beam Search [44.94458898538114]
This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list.
The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data.
arXiv Detail & Related papers (2024-01-19T01:36:07Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network [14.115294331065318]
We introduce a contextual phrase prediction network for an attention-based deep bias method.
This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model.
Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models.
arXiv Detail & Related papers (2023-05-21T16:08:04Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Improving Robustness by Augmenting Training Sentences with
Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective.
We propose to augment the input sentences in the training data with their corresponding predicate-argument structures.
We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.