Improving ASR Contextual Biasing with Guided Attention
- URL: http://arxiv.org/abs/2401.08835v1
- Date: Tue, 16 Jan 2024 21:16:12 GMT
- Title: Improving ASR Contextual Biasing with Guided Attention
- Authors: Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar,
Shinji Watanabe
- Abstract summary: A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases.
We propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters.
- Score: 47.74990801299927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a Guided Attention (GA) auxiliary training loss,
which improves the effectiveness and robustness of automatic speech recognition
(ASR) contextual biasing without introducing additional parameters. A common
challenge in previous literature is that the word error rate (WER) reduction
brought by contextual biasing diminishes as the number of bias phrases
increases. To address this challenge, we employ a GA loss as an additional
training objective besides the Transducer loss. The proposed GA loss aims to
teach the cross attention how to align bias phrases with text tokens or audio
frames. Compared to studies with similar motivations, the proposed loss
operates directly on the cross attention weights and is easier to implement.
Through extensive experiments based on Conformer Transducer with Contextual
Adapter, we demonstrate that the proposed method not only leads to a lower WER
but also retains its effectiveness as the number of bias phrases increases.
Specifically, the GA loss decreases the WER of rare vocabularies by up to 19.2%
on LibriSpeech compared to the contextual biasing baseline, and up to 49.3%
compared to a vanilla Transducer.
Related papers
- Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss [44.94458898538114]
Using explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives.
Our proposed intermediate biasing loss brings more regularization and contextualization to the network.
arXiv Detail & Related papers (2024-06-23T14:22:59Z) - Text Injection for Neural Contextual Biasing [57.589903308622745]
This work proposes contextual text injection (CTI) to enhance contextual ASR.
CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model.
arXiv Detail & Related papers (2024-06-05T04:20:17Z) - Incorporating granularity bias as the margin into contrastive loss for
video captioning [0.0]
Long-tail distribution of phrases makes captioning models prone to generate vague sentences instead of accurate ones.
We introduce a statistical-based bias extractor to estimate the likelihood that a video-sentence pair is affected by granularity bias.
We then incorporate the margin score into the contrastive learning loss, establishing training objectives for head and tail sentences.
arXiv Detail & Related papers (2023-11-25T09:38:24Z) - Can Contextual Biasing Remain Effective with Whisper and GPT-2? [18.783162616664363]
This paper investigates the effectiveness of neural contextual biasing for Whisper combined with GPT-2.
Experiments across three datasets show a considerable reduction in errors on biasing words with a biasing list of 1000 words.
arXiv Detail & Related papers (2023-06-02T22:56:01Z) - Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network [14.115294331065318]
We introduce a contextual phrase prediction network for an attention-based deep bias method.
This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model.
Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models.
arXiv Detail & Related papers (2023-05-21T16:08:04Z) - Consistency Regularization for Adversarial Robustness [88.65786118562005]
Adversarial training is one of the most successful methods to obtain the adversarial robustness of deep neural networks.
However, a significant generalization gap in the robustness obtained from AT has been problematic.
In this paper, we investigate data augmentation techniques to address the issue.
arXiv Detail & Related papers (2021-03-08T09:21:41Z) - A Simple but Tough-to-Beat Data Augmentation Approach for Natural
Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff.
cutoff relies on sampling consistency and thus adds little computational overhead.
cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z) - PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph
Generation [58.98802062945709]
We propose a novel Predicate-Correlation Perception Learning scheme to adaptively seek out appropriate loss weights.
Our PCPL framework is further equipped with a graph encoder module to better extract context features.
arXiv Detail & Related papers (2020-09-02T08:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.