Streaming on-device detection of device directed speech from voice and
touch-based invocation
- URL: http://arxiv.org/abs/2110.04656v1
- Date: Sat, 9 Oct 2021 22:33:42 GMT
- Title: Streaming on-device detection of device directed speech from voice and
touch-based invocation
- Authors: Ognjen Rudovic, Akanksha Bindal, Vineet Garg, Pramod Simha, Pranay
Dighe and Sachin Kajarekar
- Abstract summary: We propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection.
To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN)
To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a streaming fashion.
- Score: 12.42440115067583
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: When interacting with smart devices such as mobile phones or wearables, the
user typically invokes a virtual assistant (VA) by saying a keyword or by
pressing a button on the device. However, in many cases, the VA can
accidentally be invoked by the keyword-like speech or accidental button press,
which may have implications on user experience and privacy. To this end, we
propose an acoustic false-trigger-mitigation (FTM) approach for on-device
device-directed speech detection that simultaneously handles the voice-trigger
and touch-based invocation. To facilitate the model deployment on-device, we
introduce a new streaming decision layer, derived using the notion of temporal
convolutional networks (TCN) [1], known for their computational efficiency. To
the best of our knowledge, this is the first approach that can detect
device-directed speech from more than one invocation type in a streaming
fashion. We compare this approach with streaming alternatives based on vanilla
Average layer, and canonical LSTMs, and show: (i) that all the models show only
a small degradation in accuracy compared with the invocation-specific models,
and (ii) that the newly introduced streaming TCN consistently performs better
or comparable with the alternatives, while mitigating device undirected speech
faster in time, and with (relative) reduction in runtime peak-memory over the
LSTM-based approach of 33% vs. 7%, when compared to a non-streaming
counterpart.
Related papers
- Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z) - LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme
conversion [18.83348872103488]
Grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations.
Existing methods are either slow or poor in performance, and are limited in application scenarios.
We propose a novel method named LiteG2P which is fast, light and theoretically parallel.
arXiv Detail & Related papers (2023-03-02T09:16:21Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.