Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network
- URL: http://arxiv.org/abs/2305.12493v5
- Date: Wed, 12 Jul 2023 17:41:53 GMT
- Title: Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network
- Authors: Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu,
Tianyi Xu, Lei Xie
- Abstract summary: We introduce a contextual phrase prediction network for an attention-based deep bias method.
This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model.
Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models.
- Score: 14.115294331065318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contextual information plays a crucial role in speech recognition
technologies and incorporating it into the end-to-end speech recognition models
has drawn immense interest recently. However, previous deep bias methods lacked
explicit supervision for bias tasks. In this study, we introduce a contextual
phrase prediction network for an attention-based deep bias method. This network
predicts context phrases in utterances using contextual embeddings and
calculates bias loss to assist in the training of the contextualized model. Our
method achieved a significant word error rate (WER) reduction across various
end-to-end speech recognition models. Experiments on the LibriSpeech corpus
show that our proposed model obtains a 12.1% relative WER improvement over the
baseline model, and the WER of the context phrases decreases relatively by
40.5%. Moreover, by applying a context phrase filtering strategy, we also
effectively eliminate the WER degradation when using a larger biasing list.
Related papers
- Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss [44.94458898538114]
Using explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives.
Our proposed intermediate biasing loss brings more regularization and contextualization to the network.
arXiv Detail & Related papers (2024-06-23T14:22:59Z) - Text Injection for Neural Contextual Biasing [57.589903308622745]
This work proposes contextual text injection (CTI) to enhance contextual ASR.
CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model.
arXiv Detail & Related papers (2024-06-05T04:20:17Z) - Contextualized Automatic Speech Recognition with Attention-Based Bias
Phrase Boosted Beam Search [44.94458898538114]
This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list.
The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data.
arXiv Detail & Related papers (2024-01-19T01:36:07Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Improving End-to-End Contextual Speech Recognition with Fine-grained
Contextual Knowledge Selection [21.116123328330467]
This work focuses on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS)
We first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates.
We re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations.
arXiv Detail & Related papers (2022-01-30T13:08:16Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - An Effective Contextual Language Modeling Framework for Speech
Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks.
We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition.
We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z) - Fast and Robust Unsupervised Contextual Biasing for Speech Recognition [16.557586847398778]
We propose an alternative approach that does not entail explicit contextual language model.
We derive the bias score for every word in the system vocabulary from the training corpus.
We show significant improvement in recognition accuracy when the relevant context is available.
arXiv Detail & Related papers (2020-05-04T17:29:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.