Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM
Networks
- URL: http://arxiv.org/abs/2002.10851v1
- Date: Tue, 25 Feb 2020 13:27:31 GMT
- Title: Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM
Networks
- Authors: Th\'eodore Bluche, Ma\"el Primet, Thibault Gisselbrecht
- Abstract summary: In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model.
We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords.
- Score: 3.8382752162527933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore a keyword-based spoken language understanding system, in which the
intent of the user can directly be derived from the detection of a sequence of
keywords in the query. In this paper, we focus on an open-vocabulary keyword
spotting method, allowing the user to define their own keywords without having
to retrain the whole model. We describe the different design choices leading to
a fast and small-footprint system, able to run on tiny devices, for any
arbitrary set of user-defined keywords, without training data specific to those
keywords. The model, based on a quantized long short-term memory (LSTM) neural
network, trained with connectionist temporal classification (CTC), weighs less
than 500KB. Our approach takes advantage of some properties of the predictions
of CTC-trained networks to calibrate the confidence scores and implement a fast
detection algorithm. The proposed system outperforms a standard keyword-filler
model approach.
Related papers
- Semantic Meta-Split Learning: A TinyML Scheme for Few-Shot Wireless Image Classification [50.28867343337997]
This work presents a TinyML-based semantic communication framework for few-shot wireless image classification.
We exploit split-learning to limit the computations performed by the end-users while ensuring privacy-preserving.
meta-learning overcomes data availability concerns and speeds up training by utilizing similarly trained tasks.
arXiv Detail & Related papers (2024-09-03T05:56:55Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Few-Shot Open-Set Learning for On-Device Customization of KeyWord
Spotting Systems [41.24728444810133]
This paper investigates few-shot learning methods for open-set KWS classification by combining a deep feature encoder with a prototype-based classifier.
With user-defined keywords from 10 classes of the Google Speech Command dataset, our study reports an accuracy of up to 76% in a 10-shot scenario.
arXiv Detail & Related papers (2023-06-03T17:10:33Z) - To Wake-up or Not to Wake-up: Reducing Keyword False Alarm by Successive
Refinement [58.96644066571205]
We show that existing deep keyword spotting mechanisms can be improved by Successive Refinement.
We show across multiple models with size ranging from 13K parameters to 2.41M parameters, the successive refinement technique reduces FA by up to a factor of 8.
Our proposed approach is "plug-and-play" and can be applied to any deep keyword spotting model.
arXiv Detail & Related papers (2023-04-06T23:49:29Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting [6.4423565043274795]
We tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC.
We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets)
We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
arXiv Detail & Related papers (2022-06-28T01:56:24Z) - Teaching keyword spotters to spot new keywords with limited examples [6.251896411370577]
We present KeySEM, a speech embedding model pre-trained on the task of recognizing a large number of keywords.
KeySEM is well suited to on-device environments where post-deployment learning and ease of customization are often desirable.
arXiv Detail & Related papers (2021-06-04T12:43:36Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z) - Few-Shot Keyword Spotting With Prototypical Networks [3.6930948691311016]
keyword spotting has been widely used in many voice interfaces such as Amazon's Alexa and Google Home.
We first formulate this problem as a few-shot keyword spotting and approach it using metric learning.
We then propose a solution to the prototypical few-shot keyword spotting problem using temporal and dilated convolutions on networks.
arXiv Detail & Related papers (2020-07-25T20:17:56Z) - Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling [9.927306160740974]
We propose smoothed max pooling loss and its application to keyword spotting systems.
The proposed approach jointly trains an encoder (to detect keyword parts) and a decoder (to detect whole keyword) in a semi-supervised manner.
The proposed new loss function allows training a model to detect parts and whole of a keyword, without strictly depending on frame-level labeling from LVCSR.
arXiv Detail & Related papers (2020-01-25T01:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.