Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors
- URL: http://arxiv.org/abs/2408.12481v1
- Date: Thu, 22 Aug 2024 15:17:02 GMT
- Title: Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors
- Authors: Manuele Rusci, Francesco Paci, Marco Fariselli, Eric Flamand, Tinne Tuytelaars,
- Abstract summary: This paper proposes a self-learning framework to incrementally train a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors.
We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings.
Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.
- Score: 27.684160259995174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a self-learning framework to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 5 s or 16.4 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.
Related papers
- MiniCPM4: Ultra-Efficient LLMs on End Devices [124.73631357883228]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.<n>MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively.
arXiv Detail & Related papers (2025-06-09T16:16:50Z) - CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention [53.539020807256904]
We introduce a Compact for Representations of Brain Oscillations using alternating attention (CEReBrO)
Our tokenization scheme represents EEG signals at a per-channel patch.
We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention.
arXiv Detail & Related papers (2025-01-18T21:44:38Z) - TSAK: Two-Stage Semantic-Aware Knowledge Distillation for Efficient Wearable Modality and Model Optimization in Manufacturing Lines [4.503003860563811]
We present a two-stage semantic-aware knowledge distillation approach, TSAK, for efficient, privacy-aware, and wearable HAR in manufacturing lines.
Compared to the larger teacher model, the student model takes fewer sensor channels from a single hand, has 79% fewer parameters, runs 8.88 times faster, and requires 96.6% less computing power (FLOPS)
arXiv Detail & Related papers (2024-08-26T09:44:21Z) - Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems.
We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains.
STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z) - AutoMix: Automatically Mixing Language Models [62.51238143437967]
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations.
We present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM.
arXiv Detail & Related papers (2023-10-19T17:57:39Z) - Model-Generated Pretraining Signals Improves Zero-Shot Generalization of
Text-to-Text Transformers [98.30298332661323]
This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5.
We develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks.
Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity.
arXiv Detail & Related papers (2023-05-21T21:06:23Z) - HyperMODEST: Self-Supervised 3D Object Detection with Confidence Score
Filtering [9.14477900515147]
Current LiDAR-based 3D object detectors for autonomous driving are almost entirely trained on human-annotated data.
MODEST is the first work to train 3D object detectors without any labels.
We propose a universal method that can largely accelerate the self-training process and does not require tuning on a specific dataset.
arXiv Detail & Related papers (2023-04-27T18:12:11Z) - Guided Hybrid Quantization for Object detection in Multimodal Remote
Sensing Imagery via One-to-one Self-teaching [35.316067181895264]
We propose a Guided Hybrid Quantization with One-to-one Self-Teaching (GHOST) framework.
First, we first design a structure called guided quantization self-distillation (GQSD), which is an innovative idea for realizing lightweight through the synergy of quantization and distillation.
Third, in order to improve information transformation, we propose a one-to-one self-teaching (OST) module to give the student network a ability of self-judgment.
arXiv Detail & Related papers (2022-12-31T06:14:59Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - Tuning arrays with rays: Physics-informed tuning of quantum dot charge
states [0.0]
Quantum computers based on gate-defined quantum dots (QDs) are expected to scale.
As the number of qubits increases, the burden of manually calibrating these systems becomes unreasonable.
Here, we demonstrate an intuitive, reliable, and data-efficient set of tools for an automated global state and charge tuning.
arXiv Detail & Related papers (2022-09-08T14:17:49Z) - Toward smart composites: small-scale, untethered prediction and control
for soft sensor/actuator systems [0.6465251961564604]
We present a suite of algorithms and tools for model-predictive control of sensor/actuator systems with embedded microcontroller units (MCU)
These MCUs can be colocated with sensors and actuators, enabling a new class of smart composites capable of autonomous behavior.
Online Newton-Raphson optimization solves for the control input.
arXiv Detail & Related papers (2022-05-22T22:19:09Z) - Sub-mW Keyword Spotting on an MCU: Analog Binary Feature Extraction and
Binary Neural Networks [19.40893986868577]
Keywords spotting (KWS) is a crucial function enabling the interaction with the many ubiquitous smart devices in our surroundings.
This work addresses KWS energy-efficiency on low-cost microcontroller units (MCUs)
By replacing the digital preprocessing with the proposed analog front-end, we show that the energy required for data acquisition and preprocessing can be reduced by 29x.
arXiv Detail & Related papers (2022-01-10T15:10:58Z) - Noise-resistant Deep Metric Learning with Ranking-based Instance
Selection [59.286567680389766]
We propose a noise-resistant training technique for DML, which we name Probabilistic Ranking-based Instance Selection with Memory (PRISM)
PRISM identifies noisy data in a minibatch using average similarity against image features extracted from several previous versions of the neural network.
To alleviate the high computational cost brought by the memory bank, we introduce an acceleration method that replaces individual data points with the class centers.
arXiv Detail & Related papers (2021-03-30T03:22:17Z) - Self-Supervised Person Detection in 2D Range Data using a Calibrated
Camera [83.31666463259849]
We propose a method to automatically generate training labels (called pseudo-labels) for 2D LiDAR-based person detectors.
We show that self-supervised detectors, trained or fine-tuned with pseudo-labels, outperform detectors trained using manual annotations.
Our method is an effective way to improve person detectors during deployment without any additional labeling effort.
arXiv Detail & Related papers (2020-12-16T12:10:04Z) - Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution [5.672132510411465]
Keywords Spotting (KWS) plays a vital role in human-computer interaction for smart on-device terminals and service robots.
It remains challenging to achieve the trade-off between small footprint and high accuracy for KWS task.
We propose a multi-branch temporal convolution module (MTConv), a CNN block consisting of multiple temporal convolution filters with different kernel sizes, which enriches temporal feature space.
arXiv Detail & Related papers (2020-10-20T02:07:07Z) - Simplified Self-Attention for Transformer-based End-to-End Speech
Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors.
We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z) - AutoFIS: Automatic Feature Interaction Selection in Factorization Models
for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS)
AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence.
AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.