Related papers: Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

URL: http://arxiv.org/abs/2002.10851v1
Date: Tue, 25 Feb 2020 13:27:31 GMT
Title: Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks
Authors: Th\'eodore Bluche, Ma\"el Primet, Thibault Gisselbrecht
Abstract summary: In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model. We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords.
Score: 3.8382752162527933
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We explore a keyword-based spoken language understanding system, in which the intent of the user can directly be derived from the detection of a sequence of keywords in the query. In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model. We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords, without training data specific to those keywords. The model, based on a quantized long short-term memory (LSTM) neural network, trained with connectionist temporal classification (CTC), weighs less than 500KB. Our approach takes advantage of some properties of the predictions of CTC-trained networks to calibrate the confidence scores and implement a fast detection algorithm. The proposed system outperforms a standard keyword-filler model approach.

Related papers

Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices [12.269545636442546]
This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices.<n>The model is composed of a speech encoder, a target keyword encoder, and a detection network.
arXiv Detail & Related papers (2025-08-06T20:04:08Z)
Semantic Meta-Split Learning: A TinyML Scheme for Few-Shot Wireless Image Classification [50.28867343337997]
This work presents a TinyML-based semantic communication framework for few-shot wireless image classification. We exploit split-learning to limit the computations performed by the end-users while ensuring privacy-preserving. meta-learning overcomes data availability concerns and speeds up training by utilizing similarly trained tasks.
arXiv Detail & Related papers (2024-09-03T05:56:55Z)
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z)
Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems [41.24728444810133]
This paper investigates few-shot learning methods for open-set KWS classification by combining a deep feature encoder with a prototype-based classifier. With user-defined keywords from 10 classes of the Google Speech Command dataset, our study reports an accuracy of up to 76% in a 10-shot scenario.
arXiv Detail & Related papers (2023-06-03T17:10:33Z)
To Wake-up or Not to Wake-up: Reducing Keyword False Alarm by Successive Refinement [58.96644066571205]
We show that existing deep keyword spotting mechanisms can be improved by Successive Refinement. We show across multiple models with size ranging from 13K parameters to 2.41M parameters, the successive refinement technique reduces FA by up to a factor of 8. Our proposed approach is "plug-and-play" and can be applied to any deep keyword spotting model.
arXiv Detail & Related papers (2023-04-06T23:49:29Z)
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning) It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z)
Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting [6.4423565043274795]
We tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC. We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets) We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
arXiv Detail & Related papers (2022-06-28T01:56:24Z)
Teaching keyword spotters to spot new keywords with limited examples [6.251896411370577]
We present KeySEM, a speech embedding model pre-trained on the task of recognizing a large number of keywords. KeySEM is well suited to on-device environments where post-deployment learning and ease of customization are often desirable.
arXiv Detail & Related papers (2021-06-04T12:43:36Z)
MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction. MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context. We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z)
A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation. The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z)
Few-Shot Keyword Spotting With Prototypical Networks [3.6930948691311016]
keyword spotting has been widely used in many voice interfaces such as Amazon's Alexa and Google Home. We first formulate this problem as a few-shot keyword spotting and approach it using metric learning. We then propose a solution to the prototypical few-shot keyword spotting problem using temporal and dilated convolutions on networks.
arXiv Detail & Related papers (2020-07-25T20:17:56Z)
Learning To Detect Keyword Parts And Whole By Smoothed Max Pooling [9.927306160740974]
We propose smoothed max pooling loss and its application to keyword spotting systems. The proposed approach jointly trains an encoder (to detect keyword parts) and a decoder (to detect whole keyword) in a semi-supervised manner. The proposed new loss function allows training a model to detect parts and whole of a keyword, without strictly depending on frame-level labeling from LVCSR.
arXiv Detail & Related papers (2020-01-25T01:19:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.