Fast and Robust Unsupervised Contextual Biasing for Speech Recognition
- URL: http://arxiv.org/abs/2005.01677v1
- Date: Mon, 4 May 2020 17:29:59 GMT
- Title: Fast and Robust Unsupervised Contextual Biasing for Speech Recognition
- Authors: Young Mo Kang, Yingbo Zhou
- Abstract summary: We propose an alternative approach that does not entail explicit contextual language model.
We derive the bias score for every word in the system vocabulary from the training corpus.
We show significant improvement in recognition accuracy when the relevant context is available.
- Score: 16.557586847398778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition (ASR) system is becoming a ubiquitous
technology. Although its accuracy is closing the gap with that of human level
under certain settings, one area that can further improve is to incorporate
user-specific information or context to bias its prediction. A common framework
is to dynamically construct a small language model from the provided contextual
mini corpus and interpolate its score with the main language model during the
decoding process.
Here we propose an alternative approach that does not entail explicit
contextual language model. Instead, we derive the bias score for every word in
the system vocabulary from the training corpus. The method is unique in that 1)
it does not require meta-data or class-label annotation for the context or the
training corpus. 2) The bias score is proportional to the word's
log-probability, thus not only would it bias the provided context, but also
robust against irrelevant context (e.g. user mis-specified or in case where it
is hard to quantify a tight scope). 3) The bias score for the entire vocabulary
is pre-determined during the training stage, thereby eliminating
computationally expensive language model construction during inference.
We show significant improvement in recognition accuracy when the relevant
context is available. Additionally, we also demonstrate that the proposed
method exhibits high tolerance to false-triggering errors in the presence of
irrelevant context.
Related papers
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network [14.115294331065318]
We introduce a contextual phrase prediction network for an attention-based deep bias method.
This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model.
Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models.
arXiv Detail & Related papers (2023-05-21T16:08:04Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Contextual information integration for stance detection via
cross-attention [59.662413798388485]
Stance detection deals with identifying an author's stance towards a target.
Most existing stance detection models are limited because they do not consider relevant contextual information.
We propose an approach to integrate contextual information as text.
arXiv Detail & Related papers (2022-11-03T15:04:29Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - Improving End-to-End Contextual Speech Recognition with Fine-grained
Contextual Knowledge Selection [21.116123328330467]
This work focuses on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS)
We first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates.
We re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations.
arXiv Detail & Related papers (2022-01-30T13:08:16Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - How Context Affects Language Models' Factual Predictions [134.29166998377187]
We integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way.
We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.
arXiv Detail & Related papers (2020-05-10T09:28:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.