Lattice-based Improvements for Voice Triggering Using Graph Neural
Networks
- URL: http://arxiv.org/abs/2001.10822v1
- Date: Sat, 25 Jan 2020 01:34:15 GMT
- Title: Lattice-based Improvements for Voice Triggering Using Graph Neural
Networks
- Authors: Pranay Dighe, Saurabh Adya, Nuoyu Li, Srikanth Vishnubhotla, Devang
Naik, Adithya Sagar, Ying Ma, Stephen Pulman, Jason Williams
- Abstract summary: Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant.
In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN)
Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating 87% of false triggers at 99% true positive rate (TPR)
- Score: 12.378732821814816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice-triggered smart assistants often rely on detection of a trigger-phrase
before they start listening for the user request. Mitigation of false triggers
is an important aspect of building a privacy-centric non-intrusive smart
assistant. In this paper, we address the task of false trigger mitigation (FTM)
using a novel approach based on analyzing automatic speech recognition (ASR)
lattices using graph neural networks (GNN). The proposed approach uses the fact
that decoding lattice of a falsely triggered audio exhibits uncertainties in
terms of many alternative paths and unexpected words on the lattice arcs as
compared to the lattice of a correctly triggered audio. A pure trigger-phrase
detector model doesn't fully utilize the intent of the user speech whereas by
using the complete decoding lattice of user audio, we can effectively mitigate
speech not intended for the smart assistant. We deploy two variants of GNNs in
this paper based on 1) graph convolution layers and 2) self-attention mechanism
respectively. Our experiments demonstrate that GNNs are highly accurate in FTM
task by mitigating ~87% of false triggers at 99% true positive rate (TPR).
Furthermore, the proposed models are fast to train and efficient in parameter
requirements.
Related papers
- Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition [4.164975438207411]
In recent years, the typical backdoor attacks have been researched in speech recognition systems.
The attacker adds some incorporated changes to benign speech spectrograms or changes the speech components, such as pitch and timbre.
To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation.
arXiv Detail & Related papers (2024-06-16T13:29:21Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - Improving Voice Trigger Detection with Metric Learning [15.531040328839639]
We propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy.
A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance.
Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate.
arXiv Detail & Related papers (2022-04-05T18:59:27Z) - Device-Directed Speech Detection: Regularization via Distillation for
Weakly-Supervised Models [13.456066434598155]
We address the problem of detecting speech directed to a device that does not contain a specific wake-word.
Specifically, we focus on audio coming from a touch-based invocation.
arXiv Detail & Related papers (2022-03-30T01:27:39Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z) - Knowledge Transfer for Efficient On-device False Trigger Mitigation [17.53768388104929]
An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is essential for designing a privacy-centric smart assistant.
We propose an LSTM-based FTM architecture which determines the user intent from acoustic features directly without explicitly generating ASR transcripts.
arXiv Detail & Related papers (2020-10-20T20:01:44Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Voice trigger detection from LVCSR hypothesis lattices using
bidirectional lattice recurrent neural networks [5.844015313757266]
We propose a method to reduce false voice triggers of a speech-enabled personal assistant by post-processing the hypothesis lattice of a server-side continuous speech recognizer via a neural network.
We first discuss how an estimate of the posterior probability of the trigger phrase can be obtained from the hypothesis lattice using known techniques to perform detection, then investigate a statistical model that processes the lattice in a more explicitly data-driven, discriminative manner.
arXiv Detail & Related papers (2020-02-29T17:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.