Minimising Biasing Word Errors for Contextual ASR with the
Tree-Constrained Pointer Generator
- URL: http://arxiv.org/abs/2205.09058v1
- Date: Wed, 18 May 2022 16:40:50 GMT
- Title: Minimising Biasing Word Errors for Contextual ASR with the
Tree-Constrained Pointer Generator
- Authors: Guangzhi Sun, Chao Zhang, Philip C Woodland
- Abstract summary: Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words.
This paper proposes a novel tree-constrained pointer generator ( TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words.
- Score: 19.372248692745167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual knowledge is essential for reducing speech recognition errors on
high-valued long-tail words. This paper proposes a novel tree-constrained
pointer generator (TCPGen) component that enables end-to-end ASR models to bias
towards a list of long-tail words obtained using external contextual
information. With only a small overhead in memory use and computation cost,
TCPGen can structure thousands of biasing words efficiently into a symbolic
prefix-tree and creates a neural shortcut between the tree and the final ASR
output to facilitate the recognition of the biasing words. To enhance TCPGen,
we further propose a novel minimum biasing word error (MBWE) loss that directly
optimises biasing word errors during training, along with a biasing-word-driven
language model discounting (BLMD) method during the test. All contextual ASR
systems were evaluated on the public Librispeech audiobook corpus and the data
from the dialogue state tracking challenges (DSTC) with the biasing lists
extracted from the dialogue-system ontology. Consistent word error rate (WER)
reductions were achieved with TCPGen, which were particularly significant on
the biasing words with around 40\% relative reductions in the recognition error
rates. MBWE and BLMD further improved the effectiveness of TCPGen and achieved
more significant WER reductions on the biasing words. TCPGen also achieved
zero-shot learning of words not in the audio training set with large WER
reductions on the out-of-vocabulary words in the biasing list.
Related papers
- Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss [44.94458898538114]
Using explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives.
Our proposed intermediate biasing loss brings more regularization and contextualization to the network.
arXiv Detail & Related papers (2024-06-23T14:22:59Z) - Text Injection for Neural Contextual Biasing [57.589903308622745]
This work proposes contextual text injection (CTI) to enhance contextual ASR.
CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model.
arXiv Detail & Related papers (2024-06-05T04:20:17Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - Contextualized End-to-End Speech Recognition with Contextual Phrase
Prediction Network [14.115294331065318]
We introduce a contextual phrase prediction network for an attention-based deep bias method.
This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model.
Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models.
arXiv Detail & Related papers (2023-05-21T16:08:04Z) - Tree-constrained Pointer Generator with Graph Neural Network Encodings
for Contextual Speech Recognition [19.372248692745167]
This paper proposes the use of graph neural network (GNN) encodings in a tree-constrained pointer generator ( TCPGen) component for end-to-end contextual ASR.
TCPGen with GNN encodings achieved about a further 15% relative WER reduction on the biasing words compared to the original TCPGen.
arXiv Detail & Related papers (2022-07-02T15:12:18Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Tree-constrained Pointer Generator for End-to-end Contextual Speech
Recognition [16.160767678589895]
TCPGen is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models.
TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut to facilitate recognising biasing words during decoding.
arXiv Detail & Related papers (2021-09-01T21:41:59Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.