Voice trigger detection from LVCSR hypothesis lattices using
bidirectional lattice recurrent neural networks
- URL: http://arxiv.org/abs/2003.00304v1
- Date: Sat, 29 Feb 2020 17:02:41 GMT
- Title: Voice trigger detection from LVCSR hypothesis lattices using
bidirectional lattice recurrent neural networks
- Authors: Woojay Jeon, Leo Liu, Henry Mason
- Abstract summary: We propose a method to reduce false voice triggers of a speech-enabled personal assistant by post-processing the hypothesis lattice of a server-side continuous speech recognizer via a neural network.
We first discuss how an estimate of the posterior probability of the trigger phrase can be obtained from the hypothesis lattice using known techniques to perform detection, then investigate a statistical model that processes the lattice in a more explicitly data-driven, discriminative manner.
- Score: 5.844015313757266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method to reduce false voice triggers of a speech-enabled
personal assistant by post-processing the hypothesis lattice of a server-side
large-vocabulary continuous speech recognizer (LVCSR) via a neural network. We
first discuss how an estimate of the posterior probability of the trigger
phrase can be obtained from the hypothesis lattice using known techniques to
perform detection, then investigate a statistical model that processes the
lattice in a more explicitly data-driven, discriminative manner. We propose
using a Bidirectional Lattice Recurrent Neural Network (LatticeRNN) for the
task, and show that it can significantly improve detection accuracy over using
the 1-best result or the posterior.
Related papers
- HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Surrogate Gradient Spiking Neural Networks as Encoders for Large
Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method.
They have shown promising results on speech command recognition tasks.
In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z) - VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers.
By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Improving the fusion of acoustic and text representations in RNN-T [35.43599666228086]
We propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations.
We show that the joint use of the proposed methods can result in 4%--5% relative word error rate reductions with only a few million extra parameters.
arXiv Detail & Related papers (2022-01-25T11:20:50Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - StutterNet: Stuttering Detection Using Time Delay Neural Network [9.726119468893721]
This paper introduce StutterNet, a novel deep learning based stuttering detection system.
We use a time-delay neural network (TDNN) suitable for capturing contextual aspects of the disfluent utterances.
Our method achieves promising results and outperforms the state-of-the-art residual neural network based method.
arXiv Detail & Related papers (2021-05-12T11:36:01Z) - Scalable Polyhedral Verification of Recurrent Neural Networks [9.781772283276734]
We present a scalable and precise verifier for recurrent neural networks, called Prover.
Our evaluation shows that Prover successfully verifies several challenging recurrent models in computer vision, speech, and motion sensor classification.
arXiv Detail & Related papers (2020-05-27T11:57:01Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Lattice-based Improvements for Voice Triggering Using Graph Neural
Networks [12.378732821814816]
Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant.
In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN)
Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating 87% of false triggers at 99% true positive rate (TPR)
arXiv Detail & Related papers (2020-01-25T01:34:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.