Teaching keyword spotters to spot new keywords with limited examples
- URL: http://arxiv.org/abs/2106.02443v1
- Date: Fri, 4 Jun 2021 12:43:36 GMT
- Title: Teaching keyword spotters to spot new keywords with limited examples
- Authors: Abhijeet Awasthi, Kevin Kilgour, Hassan Rom
- Abstract summary: We present KeySEM, a speech embedding model pre-trained on the task of recognizing a large number of keywords.
KeySEM is well suited to on-device environments where post-deployment learning and ease of customization are often desirable.
- Score: 6.251896411370577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to recognize new keywords with just a few examples is essential for
personalizing keyword spotting (KWS) models to a user's choice of keywords.
However, modern KWS models are typically trained on large datasets and
restricted to a small vocabulary of keywords, limiting their transferability to
a broad range of unseen keywords. Towards easily customizable KWS models, we
present KeySEM (Keyword Speech EMbedding), a speech embedding model pre-trained
on the task of recognizing a large number of keywords. Speech representations
offered by KeySEM are highly effective for learning new keywords from a limited
number of examples. Comparisons with a diverse range of related work across
several datasets show that our method achieves consistently superior
performance with fewer training examples. Although KeySEM was pre-trained only
on English utterances, the performance gains also extend to datasets from four
other languages indicating that KeySEM learns useful representations well
aligned with the task of keyword spotting. Finally, we demonstrate KeySEM's
ability to learn new keywords sequentially without requiring to re-train on
previously learned keywords. Our experimental observations suggest that KeySEM
is well suited to on-device environments where post-deployment learning and
ease of customization are often desirable.
Related papers
- Open-vocabulary Keyword-spotting with Adaptive Instance Normalization [18.250276540068047]
We propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters.
We show significant improvements over recent keyword spotting and ASR baselines.
arXiv Detail & Related papers (2023-09-13T13:49:42Z) - Improving Small Footprint Few-shot Keyword Spotting with Supervision on
Auxiliary Data [19.075820340282934]
We propose a framework that uses easily collectible, unlabeled reading speech data as an auxiliary source.
We then adopt multi-task learning that helps the model to enhance the representation power from out-of-domain auxiliary data.
arXiv Detail & Related papers (2023-08-31T07:29:42Z) - PatternRank: Leveraging Pretrained Language Models and Part of Speech
for Unsupervised Keyphrase Extraction [0.6767885381740952]
We present PatternRank, which pretrained language models and part-of-speech for unsupervised keyphrase extraction from single documents.
Our experiments show PatternRank achieves higher precision, recall and F1-scores than previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-10-11T08:23:54Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On the Efficiency of Integrating Self-supervised Learning and
Meta-learning for User-defined Few-shot Keyword Spotting [51.41426141283203]
User-defined keyword spotting is a task to detect new spoken terms defined by users.
Previous works try to incorporate self-supervised learning models or apply meta-learning algorithms.
Our result shows that HuBERT combined with Matching network achieves the best result.
arXiv Detail & Related papers (2022-04-01T10:59:39Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Representation Learning for Resource-Constrained Keyphrase Generation [78.02577815973764]
We introduce salient span recovery and salient span prediction as guided denoising language modeling objectives.
We show the effectiveness of the proposed approach for low-resource and zero-shot keyphrase generation.
arXiv Detail & Related papers (2022-03-15T17:48:04Z) - Meta-Learning with Variational Semantic Memory for Word Sense
Disambiguation [56.830395467247016]
We propose a model of semantic memory for WSD in a meta-learning setting.
Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork.
We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce scenarios.
arXiv Detail & Related papers (2021-06-05T20:40:01Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Few-Shot Keyword Spotting With Prototypical Networks [3.6930948691311016]
keyword spotting has been widely used in many voice interfaces such as Amazon's Alexa and Google Home.
We first formulate this problem as a few-shot keyword spotting and approach it using metric learning.
We then propose a solution to the prototypical few-shot keyword spotting problem using temporal and dilated convolutions on networks.
arXiv Detail & Related papers (2020-07-25T20:17:56Z) - Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM
Networks [3.8382752162527933]
In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model.
We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords.
arXiv Detail & Related papers (2020-02-25T13:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.