Related papers: Voice trigger detection from LVCSR hypothesis lattices using bidirectional lattice recurrent neural networks

Voice trigger detection from LVCSR hypothesis lattices using bidirectional lattice recurrent neural networks

URL: http://arxiv.org/abs/2003.00304v1
Date: Sat, 29 Feb 2020 17:02:41 GMT
Title: Voice trigger detection from LVCSR hypothesis lattices using bidirectional lattice recurrent neural networks
Authors: Woojay Jeon, Leo Liu, Henry Mason
Abstract summary: We propose a method to reduce false voice triggers of a speech-enabled personal assistant by post-processing the hypothesis lattice of a server-side continuous speech recognizer via a neural network. We first discuss how an estimate of the posterior probability of the trigger phrase can be obtained from the hypothesis lattice using known techniques to perform detection, then investigate a statistical model that processes the lattice in a more explicitly data-driven, discriminative manner.
Score: 5.844015313757266
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a method to reduce false voice triggers of a speech-enabled personal assistant by post-processing the hypothesis lattice of a server-side large-vocabulary continuous speech recognizer (LVCSR) via a neural network. We first discuss how an estimate of the posterior probability of the trigger phrase can be obtained from the hypothesis lattice using known techniques to perform detection, then investigate a statistical model that processes the lattice in a more explicitly data-driven, discriminative manner. We propose using a Bidirectional Lattice Recurrent Neural Network (LatticeRNN) for the task, and show that it can significantly improve detection accuracy over using the 1-best result or the posterior.

Related papers

VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification [9.726628816336651]
This work proposes a variational Bayesian inference framework with neural speech prior (VINP) for joint speech dereverberation and blind RIR identification. Experiments on single-channel speech dereverberation demonstrate that VINP reaches an advanced level in most metrics related to human perception.
arXiv Detail & Related papers (2025-02-11T02:54:28Z)
Observation Noise and Initialization in Wide Neural Networks [9.163214210191814]
We introduce a textitshifted network that enables arbitrary prior mean functions. Our theoretical insights are validated empirically, with experiments exploring different values of observation noise and network architectures.
arXiv Detail & Related papers (2025-02-03T17:39:45Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Surrogate Gradient Spiking Neural Networks as Encoders for Large Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method. They have shown promising results on speech command recognition tasks. In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z)
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers. By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation. Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Improving the fusion of acoustic and text representations in RNN-T [35.43599666228086]
We propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations. We show that the joint use of the proposed methods can result in 4%--5% relative word error rate reductions with only a few million extra parameters.
arXiv Detail & Related papers (2022-01-25T11:20:50Z)
Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV) We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples. Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z)
StutterNet: Stuttering Detection Using Time Delay Neural Network [9.726119468893721]
This paper introduce StutterNet, a novel deep learning based stuttering detection system. We use a time-delay neural network (TDNN) suitable for capturing contextual aspects of the disfluent utterances. Our method achieves promising results and outperforms the state-of-the-art residual neural network based method.
arXiv Detail & Related papers (2021-05-12T11:36:01Z)
Scalable Polyhedral Verification of Recurrent Neural Networks [9.781772283276734]
We present a scalable and precise verifier for recurrent neural networks, called Prover. Our evaluation shows that Prover successfully verifies several challenging recurrent models in computer vision, speech, and motion sensor classification.
arXiv Detail & Related papers (2020-05-27T11:57:01Z)
AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
Lattice-based Improvements for Voice Triggering Using Graph Neural Networks [12.378732821814816]
Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN) Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating 87% of false triggers at 99% true positive rate (TPR)
arXiv Detail & Related papers (2020-01-25T01:34:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.