Related papers: Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models

Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models

URL: http://arxiv.org/abs/2506.17686v1
Date: Sat, 21 Jun 2025 11:39:11 GMT
Title: Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models
Authors: Alican Gok, Oguzhan Buyuksolak, Osman Erman Okman, Murat Saraclar,
Abstract summary: Keywords Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices.<n>We propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation.<n>We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets.
Score: 3.25590215530292
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Keyword Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices. Few-Shot Keyword Spotting (FS-KWS) addresses the scalability and adaptability challenges of traditional systems by enabling recognition of custom keywords with only a few examples. However, existing FS-KWS systems achieve subpar accuracy at desirable false acceptance rates, particularly in resource-constrained edge environments. To address these issues, we propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation. The teacher model, based on Wav2Vec 2.0 is trained using Sub-center ArcFace loss, which enhances inter-class separability and intra-class compactness. To enable efficient deployment on edge devices, we introduce attention-based dimensionality reduction and train a standard lightweight ResNet15 student model. We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets. Notably, the proposed training method improves the 10-shot classification accuracy from 33.4% to 74.1% on 11 classes at 1% false alarm accuracy on the GSC dataset, thus making it significantly better-suited for a real use case scenario.

Related papers

Orthogonal Soft Pruning for Efficient Class Unlearning [26.76186024947296]
We propose a class-aware soft pruning framework to achieve rapid and precise forgetting with millisecond-level response times.<n>Our method decorrelates convolutional filters and disentangles feature representations, while efficiently identifying class-specific channels.
arXiv Detail & Related papers (2025-06-24T09:52:04Z)
Adaptive Noise Resilient Keyword Spotting Using One-Shot Learning [5.967661928760498]
Keywords spotting (KWS) is a key component of smart devices, enabling efficient and intuitive audio interaction.<n> KWS systems often suffer performance degradation under real-world operating conditions.<n>This study proposes a low computational approach for continuous noise adaptation of pretrained neural networks used for KWS classification.
arXiv Detail & Related papers (2025-05-14T11:39:47Z)
Few-shot Hate Speech Detection Based on the MindSpore Framework [2.6396343924017915]
We propose MS-Hate, a prompt-enhanced neural framework for few-shot hate speech detection implemented on the MindSpore deep learning platform.<n> Experimental results on two benchmark datasets-HateXplain and HSOL-demonstrate that our approach outperforms competitive baselines in precision, recall, and F1-score.<n>These findings highlight the potential of combining prompt-based learning with adversarial augmentation for robust and adaptable hate speech detection in few-shot scenarios.
arXiv Detail & Related papers (2025-04-22T15:42:33Z)
How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR)<n>In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages.<n>We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z)
Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting [18.456711824241978]
We propose datasource-aware disentangled learning with adversarial examples to improve KWS robustness. Experimental results demonstrate that the proposed learning strategy improves false reject rate by $40.31%$ at $1%$ false accept rate. Our best-performing system achieves $98.06%$ accuracy on the Google Speech Commands V1 dataset.
arXiv Detail & Related papers (2024-08-23T20:03:51Z)
Enhancing Visual Continual Learning with Language-Guided Supervision [76.38481740848434]
Continual learning aims to empower models to learn new tasks without forgetting previously acquired knowledge. We argue that the scarce semantic information conveyed by the one-hot labels hampers the effective knowledge transfer across tasks. Specifically, we use PLMs to generate semantic targets for each class, which are frozen and serve as supervision signals.
arXiv Detail & Related papers (2024-03-24T12:41:58Z)
Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks. We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z)
Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems [41.24728444810133]
This paper investigates few-shot learning methods for open-set KWS classification by combining a deep feature encoder with a prototype-based classifier. With user-defined keywords from 10 classes of the Google Speech Command dataset, our study reports an accuracy of up to 76% in a 10-shot scenario.
arXiv Detail & Related papers (2023-06-03T17:10:33Z)
Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification [57.36281142038042]
We present a new adaptive block called Contextual Squeeze-and-Excitation (CaSE) that adjusts a pretrained neural network on a new task to significantly improve performance. We also present a new training protocol based on Coordinate-Descent called UpperCaSE that exploits meta-trained CaSE blocks and fine-tuning routines for efficient adaptation.
arXiv Detail & Related papers (2022-06-20T15:25:08Z)
A Method to Reveal Speaker Identity in Distributed ASR Training, and How to Counter It [3.18475216176047]
We design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient. We show that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset.
arXiv Detail & Related papers (2021-04-15T23:15:12Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components. First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective. Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.