Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices
- URL: http://arxiv.org/abs/2508.04857v1
- Date: Wed, 06 Aug 2025 20:04:08 GMT
- Title: Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices
- Authors: Yael Segal-Feldman, Ann R. Bradlow, Matthew Goldrick, Joseph Keshet,
- Abstract summary: This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices.<n>The model is composed of a speech encoder, a target keyword encoder, and a detection network.
- Score: 12.269545636442546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary keyword spotting (KWS) refers to the task of detecting words or terms within speech recordings, regardless of whether they were included in the training data. This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices. The model is composed of a speech encoder, a target keyword encoder, and a detection network. The speech encoder is either a tiny Whisper or a tiny Conformer. The target keyword encoder is implemented as a hyper-network that takes the desired keyword as a character string and generates a unique set of weights for a convolutional layer, which can be considered as a keyword-specific matched filter. The detection network uses the matched-filter weights to perform a keyword-specific convolution, which guides the cross-attention mechanism of a Perceiver module in determining whether the target term appears in the recording. The results indicate that our system achieves state-of-the-art detection performance and generalizes effectively to out-of-domain conditions, including second-language (L2) speech. Notably, our smallest model, with just 4.2 million parameters, matches or outperforms models that are several times larger, demonstrating both efficiency and robustness.
Related papers
- Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - SimCKP: Simple Contrastive Learning of Keyphrase Representations [36.88517357720033]
We propose SimCKP, a simple contrastive learning framework that consists of two stages: 1) An extractor-generator that extracts keyphrases by learning context-aware phrase-level representations in a contrastive manner while also generating keyphrases that do not appear in the document; and 2) A reranker that adapts scores for each generated phrase by likewise aligning their representations with the corresponding document.
arXiv Detail & Related papers (2023-10-12T11:11:54Z) - To Wake-up or Not to Wake-up: Reducing Keyword False Alarm by Successive
Refinement [58.96644066571205]
We show that existing deep keyword spotting mechanisms can be improved by Successive Refinement.
We show across multiple models with size ranging from 13K parameters to 2.41M parameters, the successive refinement technique reduces FA by up to a factor of 8.
Our proposed approach is "plug-and-play" and can be applied to any deep keyword spotting model.
arXiv Detail & Related papers (2023-04-06T23:49:29Z) - Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting [6.4423565043274795]
We tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC.
We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets)
We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
arXiv Detail & Related papers (2022-06-28T01:56:24Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - Keyphrase Extraction with Dynamic Graph Convolutional Networks and
Diversified Inference [50.768682650658384]
Keyphrase extraction (KE) aims to summarize a set of phrases that accurately express a concept or a topic covered in a given document.
Recent Sequence-to-Sequence (Seq2Seq) based generative framework is widely used in KE task, and it has obtained competitive performance on various benchmarks.
In this paper, we propose to adopt the Dynamic Graph Convolutional Networks (DGCN) to solve the above two problems simultaneously.
arXiv Detail & Related papers (2020-10-24T08:11:23Z) - Few-Shot Keyword Spotting With Prototypical Networks [3.6930948691311016]
keyword spotting has been widely used in many voice interfaces such as Amazon's Alexa and Google Home.
We first formulate this problem as a few-shot keyword spotting and approach it using metric learning.
We then propose a solution to the prototypical few-shot keyword spotting problem using temporal and dilated convolutions on networks.
arXiv Detail & Related papers (2020-07-25T20:17:56Z) - Acoustic Neighbor Embeddings [2.842794675894731]
This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings.
The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences.
The recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
arXiv Detail & Related papers (2020-07-20T05:33:07Z) - Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM
Networks [3.8382752162527933]
In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model.
We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords.
arXiv Detail & Related papers (2020-02-25T13:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.