Knowledge Transfer for Efficient On-device False Trigger Mitigation
- URL: http://arxiv.org/abs/2010.10591v1
- Date: Tue, 20 Oct 2020 20:01:44 GMT
- Title: Knowledge Transfer for Efficient On-device False Trigger Mitigation
- Authors: Pranay Dighe, Erik Marchi, Srikanth Vishnubhotla, Sachin Kajarekar,
Devang Naik
- Abstract summary: An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is essential for designing a privacy-centric smart assistant.
We propose an LSTM-based FTM architecture which determines the user intent from acoustic features directly without explicitly generating ASR transcripts.
- Score: 17.53768388104929
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the task of determining whether a given utterance
is directed towards a voice-enabled smart-assistant device or not. An
undirected utterance is termed as a "false trigger" and false trigger
mitigation (FTM) is essential for designing a privacy-centric non-intrusive
smart assistant. The directedness of an utterance can be identified by running
automatic speech recognition (ASR) on it and determining the user intent by
analyzing the ASR transcript. But in case of a false trigger, transcribing the
audio using ASR itself is strongly undesirable. To alleviate this issue, we
propose an LSTM-based FTM architecture which determines the user intent from
acoustic features directly without explicitly generating ASR transcripts from
the audio. The proposed models are small footprint and can be run on-device
with limited computational resources. During training, the model parameters are
optimized using a knowledge transfer approach where a more accurate
self-attention graph neural network model serves as the teacher. Given the
whole audio snippets, our approach mitigates 87% of false triggers at 99% true
positive rate (TPR), and in a streaming audio scenario, the system listens to
only 1.69s of the false trigger audio before rejecting it while achieving the
same TPR.
Related papers
- Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Improving Voice Trigger Detection with Metric Learning [15.531040328839639]
We propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy.
A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance.
Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate.
arXiv Detail & Related papers (2022-04-05T18:59:27Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Device-Directed Speech Detection: Regularization via Distillation for
Weakly-Supervised Models [13.456066434598155]
We address the problem of detecting speech directed to a device that does not contain a specific wake-word.
Specifically, we focus on audio coming from a touch-based invocation.
arXiv Detail & Related papers (2022-03-30T01:27:39Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Streaming Transformer for Hardware Efficient Voice Trigger Detection and
False Trigger Mitigation [9.691823786336716]
We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks.
Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device.
We propose a streaming transformer architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks.
arXiv Detail & Related papers (2021-05-14T00:41:42Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Lattice-based Improvements for Voice Triggering Using Graph Neural
Networks [12.378732821814816]
Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant.
In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN)
Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating 87% of false triggers at 99% true positive rate (TPR)
arXiv Detail & Related papers (2020-01-25T01:34:15Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.