Tree-constrained Pointer Generator for End-to-end Contextual Speech
Recognition
- URL: http://arxiv.org/abs/2109.00627v2
- Date: Fri, 3 Sep 2021 09:38:53 GMT
- Title: Tree-constrained Pointer Generator for End-to-end Contextual Speech
Recognition
- Authors: Guangzhi Sun, Chao Zhang, Philip C. Woodland
- Abstract summary: TCPGen is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models.
TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut to facilitate recognising biasing words during decoding.
- Score: 16.160767678589895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual knowledge is important for real-world automatic speech recognition
(ASR) applications. In this paper, a novel tree-constrained pointer generator
(TCPGen) component is proposed that incorporates such knowledge as a list of
biasing words into both attention-based encoder-decoder and transducer
end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing
words into an efficient prefix tree to serve as its symbolic input and creates
a neural shortcut between the tree and the final ASR output distribution to
facilitate recognising biasing words during decoding. Systems were trained and
evaluated on the Librispeech corpus where biasing words were extracted at the
scales of an utterance, a chapter, or a book to simulate different application
scenarios. Experimental results showed that TCPGen consistently improved word
error rates (WERs) compared to the baselines, and in particular, achieved
significant WER reductions on the biasing words. TCPGen is highly efficient: it
can handle 5,000 biasing words and distractors and only add a small overhead to
memory use and computation cost.
Related papers
- Phoneme-aware Encoding for Prefix-tree-based Contextual ASR [45.161909551392085]
Tree-constrained Pointer Generator ( TCPGen) has shown promise for this purpose.
We propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations.
arXiv Detail & Related papers (2023-12-15T07:37:09Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Graph Neural Networks for Contextual ASR with the Tree-Constrained
Pointer Generator [9.053645441056256]
This paper proposes an innovative method for achieving end-to-end contextual ASR using graph neural network (GNN) encodings.
GNN encodings facilitate lookahead for future word pieces in the process of ASR decoding at each tree node.
The performance of the systems was evaluated using the Librispeech and AMI corpus, following the visual-grounded contextual ASR pipeline.
arXiv Detail & Related papers (2023-05-30T08:20:58Z) - LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor.
The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z) - Neuro-Symbolic Causal Reasoning Meets Signaling Game for Emergent
Semantic Communications [71.63189900803623]
A novel emergent SC system framework is proposed and is composed of a signaling game for emergent language design and a neuro-symbolic (NeSy) artificial intelligence (AI) approach for causal reasoning.
The ESC system is designed to enhance the novel metrics of semantic information, reliability, distortion and similarity.
arXiv Detail & Related papers (2022-10-21T15:33:37Z) - Tree-constrained Pointer Generator with Graph Neural Network Encodings
for Contextual Speech Recognition [19.372248692745167]
This paper proposes the use of graph neural network (GNN) encodings in a tree-constrained pointer generator ( TCPGen) component for end-to-end contextual ASR.
TCPGen with GNN encodings achieved about a further 15% relative WER reduction on the biasing words compared to the original TCPGen.
arXiv Detail & Related papers (2022-07-02T15:12:18Z) - Minimising Biasing Word Errors for Contextual ASR with the
Tree-Constrained Pointer Generator [19.372248692745167]
Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words.
This paper proposes a novel tree-constrained pointer generator ( TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words.
arXiv Detail & Related papers (2022-05-18T16:40:50Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Knowledge Transfer from Large-scale Pretrained Language Models to
End-to-end Speech Recognizers [13.372686722688325]
Training of end-to-end speech recognizers always requires transcribed utterances.
This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
arXiv Detail & Related papers (2022-02-16T07:02:24Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.