Text Injection for Neural Contextual Biasing
- URL: http://arxiv.org/abs/2406.02921v2
- Date: Tue, 11 Jun 2024 04:11:56 GMT
- Title: Text Injection for Neural Contextual Biasing
- Authors: Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran,
- Abstract summary: This work proposes contextual text injection (CTI) to enhance contextual ASR.
CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model.
- Score: 57.589903308622745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model's attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.
Related papers
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network [14.115294331065318]
We introduce a contextual phrase prediction network for an attention-based deep bias method.
This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model.
Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models.
arXiv Detail & Related papers (2023-05-21T16:08:04Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - Automatic Prosody Annotation with Pre-Trained Text-Speech Model [48.47706377700962]
We propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders.
This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: speech, text, prosody
arXiv Detail & Related papers (2022-06-16T06:54:16Z) - Minimising Biasing Word Errors for Contextual ASR with the
Tree-Constrained Pointer Generator [19.372248692745167]
Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words.
This paper proposes a novel tree-constrained pointer generator ( TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words.
arXiv Detail & Related papers (2022-05-18T16:40:50Z) - Distribution augmentation for low-resource expressive text-to-speech [18.553812159109253]
This paper presents a novel data augmentation technique for text-to-speech (TTS)
It allows to generate new (text, audio) training examples without requiring any additional data.
arXiv Detail & Related papers (2022-02-13T21:19:31Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.