Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss
- URL: http://arxiv.org/abs/2406.16120v1
- Date: Sun, 23 Jun 2024 14:22:59 GMT
- Title: Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss
- Authors: Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe,
- Abstract summary: Using explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives.
Our proposed intermediate biasing loss brings more regularization and contextualization to the network.
- Score: 44.94458898538114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.
Related papers
- Improving ASR Contextual Biasing with Guided Attention [47.74990801299927]
A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases.
We propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters.
arXiv Detail & Related papers (2024-01-16T21:16:12Z) - Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network [14.115294331065318]
We introduce a contextual phrase prediction network for an attention-based deep bias method.
This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model.
Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models.
arXiv Detail & Related papers (2023-05-21T16:08:04Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Minimising Biasing Word Errors for Contextual ASR with the
Tree-Constrained Pointer Generator [19.372248692745167]
Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words.
This paper proposes a novel tree-constrained pointer generator ( TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words.
arXiv Detail & Related papers (2022-05-18T16:40:50Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Improving End-to-End Contextual Speech Recognition with Fine-grained
Contextual Knowledge Selection [21.116123328330467]
This work focuses on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS)
We first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates.
We re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations.
arXiv Detail & Related papers (2022-01-30T13:08:16Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.