To Wake-up or Not to Wake-up: Reducing Keyword False Alarm by Successive
Refinement
- URL: http://arxiv.org/abs/2304.03416v1
- Date: Thu, 6 Apr 2023 23:49:29 GMT
- Title: To Wake-up or Not to Wake-up: Reducing Keyword False Alarm by Successive
Refinement
- Authors: Yashas Malur Saidutta, Rakshith Sharma Srinivasa, Ching-Hua Lee,
Chouchang Yang, Yilin Shen, Hongxia Jin
- Abstract summary: We show that existing deep keyword spotting mechanisms can be improved by Successive Refinement.
We show across multiple models with size ranging from 13K parameters to 2.41M parameters, the successive refinement technique reduces FA by up to a factor of 8.
Our proposed approach is "plug-and-play" and can be applied to any deep keyword spotting model.
- Score: 58.96644066571205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyword spotting systems continuously process audio streams to detect
keywords. One of the most challenging tasks in designing such systems is to
reduce False Alarm (FA) which happens when the system falsely registers a
keyword despite the keyword not being uttered. In this paper, we propose a
simple yet elegant solution to this problem that follows from the law of total
probability. We show that existing deep keyword spotting mechanisms can be
improved by Successive Refinement, where the system first classifies whether
the input audio is speech or not, followed by whether the input is keyword-like
or not, and finally classifies which keyword was uttered. We show across
multiple models with size ranging from 13K parameters to 2.41M parameters, the
successive refinement technique reduces FA by up to a factor of 8 on in-domain
held-out FA data, and up to a factor of 7 on out-of-domain (OOD) FA data.
Further, our proposed approach is "plug-and-play" and can be applied to any
deep keyword spotting model.
Related papers
- A Glance is Enough: Extract Target Sentence By Looking at A keyword [26.77461726960814]
This paper investigates the possibility of extracting a target sentence from multi-talker speech using only a keyword as input.
In social security applications, the keyword might be "help", and the goal is to identify what the person who called for help is articulating while ignoring other speakers.
We propose using the Transformer architecture to embed both the keyword and the speech utterance and then rely on the cross-attention mechanism to select the correct content.
arXiv Detail & Related papers (2023-10-09T02:28:19Z) - Open-vocabulary Keyword-spotting with Adaptive Instance Normalization [18.250276540068047]
We propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters.
We show significant improvements over recent keyword spotting and ASR baselines.
arXiv Detail & Related papers (2023-09-13T13:49:42Z) - Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting [6.4423565043274795]
We tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC.
We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets)
We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
arXiv Detail & Related papers (2022-06-28T01:56:24Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Instant One-Shot Word-Learning for Context-Specific Neural
Sequence-to-Sequence Speech Recognition [62.997667081978825]
We present an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize.
arXiv Detail & Related papers (2021-07-05T21:08:34Z) - Few-Shot Keyword Spotting With Prototypical Networks [3.6930948691311016]
keyword spotting has been widely used in many voice interfaces such as Amazon's Alexa and Google Home.
We first formulate this problem as a few-shot keyword spotting and approach it using metric learning.
We then propose a solution to the prototypical few-shot keyword spotting problem using temporal and dilated convolutions on networks.
arXiv Detail & Related papers (2020-07-25T20:17:56Z) - Exclusive Hierarchical Decoding for Deep Keyphrase Generation [63.357895318562214]
Keyphrase generation (KG) aims to summarize the main ideas of a document into a set of keyphrases.
Previous work in this setting employs a sequential decoding process to generate keyphrases.
We propose an exclusive hierarchical decoding framework that includes a hierarchical decoding process and either a soft or a hard exclusion mechanism.
arXiv Detail & Related papers (2020-04-18T02:58:00Z) - Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system.
One of the popular approach to cover OOVs is to use subword units rather then words.
In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z) - Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM
Networks [3.8382752162527933]
In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model.
We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords.
arXiv Detail & Related papers (2020-02-25T13:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.