Self-consistent context aware conformer transducer for speech recognition
- URL: http://arxiv.org/abs/2402.06592v2
- Date: Thu, 03 Oct 2024 22:05:05 GMT
- Title: Self-consistent context aware conformer transducer for speech recognition
- Authors: Konstantin Kolokolov, Pavel Pekichev, Karthik Raghunathan,
- Abstract summary: We introduce a novel neural network module that adeptly handles recursive data flow in neural network architectures.
Our method notably improves the accuracy of recognizing rare words without adversely affecting the word error rate for common vocabulary.
Our findings reveal that the combination of both approaches can improve the accuracy of detecting rare words by as much as 4.5 times.
- Score: 0.06008132390640294
- License:
- Abstract: We introduce a novel neural network module that adeptly handles recursive data flow in neural network architectures. At its core, this module employs a self-consistent approach where a set of recursive equations is solved iteratively, halting when the difference between two consecutive iterations falls below a defined threshold. Leveraging this mechanism, we construct a new neural network architecture, an extension of the conformer transducer, which enriches automatic speech recognition systems with a stream of contextual information. Our method notably improves the accuracy of recognizing rare words without adversely affecting the word error rate for common vocabulary. We investigate the improvement in accuracy for these uncommon words using our novel model, both independently and in conjunction with shallow fusion with a context language model. Our findings reveal that the combination of both approaches can improve the accuracy of detecting rare words by as much as 4.5 times. Our proposed self-consistent recursive methodology is versatile and adaptable, compatible with many recently developed encoders, and has the potential to drive model improvements in speech recognition and beyond.
Related papers
- Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - The neural dynamics of auditory word recognition and integration [21.582292050622456]
We present a computational model of word recognition which formalizes this perceptual process in Bayesian decision theory.
We fit this model to explain scalp EEG signals recorded as subjects passively listened to a fictional story.
The model reveals distinct neural processing of words depending on whether or not they can be quickly recognized.
arXiv Detail & Related papers (2023-05-22T18:06:32Z) - Surrogate Gradient Spiking Neural Networks as Encoders for Large
Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method.
They have shown promising results on speech command recognition tasks.
In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Word-level confidence estimation for RNN transducers [7.12355127219356]
We present a lightweight neural confidence model tailored for Automatic Speech Recognition (ASR) system with Recurrent Network Transducers (RNN-T)
Compared to other existing approaches, our model utilizes: (a) the time information associated with recognized words, which reduces the computational complexity, and (b) a simple and elegant trick for mapping between sub-word and word sequences.
arXiv Detail & Related papers (2021-09-28T18:38:00Z) - Position-Invariant Truecasing with a Word-and-Character Hierarchical
Recurrent Neural Network [10.425277173548212]
We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model.
We also address the problem of truecasing while ignoring token positions in the sentence.
arXiv Detail & Related papers (2021-08-26T17:54:35Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.