Online Continual Learning in Keyword Spotting for Low-Resource Devices
via Pooling High-Order Temporal Statistics
- URL: http://arxiv.org/abs/2307.12660v1
- Date: Mon, 24 Jul 2023 10:04:27 GMT
- Title: Online Continual Learning in Keyword Spotting for Low-Resource Devices
via Pooling High-Order Temporal Statistics
- Authors: Umberto Michieli, Pablo Peso Parada, Mete Ozay
- Abstract summary: Keywords Spotting (KWS) models on embedded devices should adapt fast to new user-defined words without forgetting previous ones.
We consider the setup of embedded online continual learning (EOCL), where KWS models with frozen backbone are trained to incrementally recognize new words from a non-repeated stream of samples.
We propose Temporal Aware Pooling (TAP) which constructs an enriched feature space computing high-order moments of speech features extracted by a pre-trained backbone.
- Score: 22.129910930772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyword Spotting (KWS) models on embedded devices should adapt fast to new
user-defined words without forgetting previous ones. Embedded devices have
limited storage and computational resources, thus, they cannot save samples or
update large models. We consider the setup of embedded online continual
learning (EOCL), where KWS models with frozen backbone are trained to
incrementally recognize new words from a non-repeated stream of samples, seen
one at a time. To this end, we propose Temporal Aware Pooling (TAP) which
constructs an enriched feature space computing high-order moments of speech
features extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a
Gaussian model for each class on the enriched feature space to effectively use
audio representations. In experimental analyses, TAP-SLDA outperforms
competitors on several setups, backbones, and baselines, bringing a relative
average gain of 11.3% on the GSC dataset.
Related papers
- Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification [14.58145497173618]
We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification.<n>LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging.
arXiv Detail & Related papers (2025-12-15T07:39:56Z) - Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset [1.497166779417398]
Keywords Spotting (KWS) is a privacy-aware intermediate task for brain-computer interfaces.<n>We release an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials.
arXiv Detail & Related papers (2025-10-23T22:44:50Z) - Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning [19.677969862434708]
We present a theoretically grounded, resource-efficient framework for data selection and reweighting.<n>At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example.<n>We derive per-example weights from ICA scores, dynamically reweighting gradient updates as model parameters evolve.
arXiv Detail & Related papers (2025-10-16T09:00:39Z) - Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z) - PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints.
PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint.
evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes [16.96483269023065]
Lifelong audio feature extraction involves learning new sound classes incrementally.
optimizing the model only on new data can lead to catastrophic forgetting of previously learned tasks.
This paper introduces a new approach to continual audio representation learning called DeCoR.
arXiv Detail & Related papers (2023-05-29T02:25:03Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Knowledge Transfer For On-Device Speech Emotion Recognition with Neural
Structured Learning [19.220263739291685]
Speech emotion recognition (SER) has been a popular research topic in human-computer interaction (HCI)
We propose a neural structured learning (NSL) framework through building synthesized graphs.
Our experiments demonstrate that training a lightweight SER model on the target dataset with speech samples and graphs can not only produce small SER models, but also enhance the model performance.
arXiv Detail & Related papers (2022-10-26T18:38:42Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Federated Action Recognition on Heterogeneous Embedded Devices [16.88104153104136]
In this work, we enable clients with limited computing power to perform action recognition, a computationally heavy task.
We first perform model compression at the central server through knowledge distillation on a large dataset.
The fine-tuning is required because limited data present in smaller datasets is not adequate for action recognition models to learn complextemporal features.
arXiv Detail & Related papers (2021-07-18T02:33:24Z) - Contrastive Prototype Learning with Augmented Embeddings for Few-Shot
Learning [58.2091760793799]
We propose a novel contrastive prototype learning with augmented embeddings (CPLAE) model.
With a class prototype as an anchor, CPL aims to pull the query samples of the same class closer and those of different classes further away.
Extensive experiments on several benchmarks demonstrate that our proposed CPLAE achieves new state-of-the-art.
arXiv Detail & Related papers (2021-01-23T13:22:44Z) - Deep Learning based Segmentation of Fish in Noisy Forward Looking MBES
Images [1.5469452301122177]
We build on recent advances in Deep Learning (DL) and Convolutional Neural Networks (CNNs) for semantic segmentation.
We demonstrate an end-to-end approach for a fish/non-fish probability prediction for all range-azimuth positions projected by an imaging sonar.
We show that our model proves the desired performance and has learned to harness the importance of semantic context.
arXiv Detail & Related papers (2020-06-16T09:57:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.