Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution
- URL: http://arxiv.org/abs/2010.09960v1
- Date: Tue, 20 Oct 2020 02:07:07 GMT
- Title: Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution
- Authors: Ximin Li, Xiaodong Wei, Xiaowei Qin
- Abstract summary: Keywords Spotting (KWS) plays a vital role in human-computer interaction for smart on-device terminals and service robots.
It remains challenging to achieve the trade-off between small footprint and high accuracy for KWS task.
We propose a multi-branch temporal convolution module (MTConv), a CNN block consisting of multiple temporal convolution filters with different kernel sizes, which enriches temporal feature space.
- Score: 5.672132510411465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyword Spotting (KWS) plays a vital role in human-computer interaction for
smart on-device terminals and service robots. It remains challenging to achieve
the trade-off between small footprint and high accuracy for KWS task. In this
paper, we explore the application of multi-scale temporal modeling to the
small-footprint keyword spotting task. We propose a multi-branch temporal
convolution module (MTConv), a CNN block consisting of multiple temporal
convolution filters with different kernel sizes, which enriches temporal
feature space. Besides, taking advantage of temporal and depthwise convolution,
a temporal efficient neural network (TENet) is designed for KWS system. Based
on the purposed model, we replace standard temporal convolution layers with
MTConvs that can be trained for better performance. While at the inference
stage, the MTConv can be equivalently converted to the base convolution
architecture, so that no extra parameters and computational costs are added
compared to the base model. The results on Google Speech Command Dataset show
that one of our models trained with MTConv performs the accuracy of 96.8% with
only 100K parameters.
Related papers
- Test-Time Training Done Right [61.8429380523577]
Test-Time Training (TTT) models context by adapting part of the model's weights (referred to as fast weights) during inference.<n>Existing TTT methods struggled to show effectiveness in handling long-context data.<n>We develop Large Chunk Test-Time Training (LaCT) which improves hardware utilization by orders of magnitude.
arXiv Detail & Related papers (2025-05-29T17:50:34Z) - MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets [3.8601741392210434]
Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies.
We present a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks to model different scales of attention.
Our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.
arXiv Detail & Related papers (2025-01-10T15:18:05Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.
Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - RepCNN: Micro-sized, Mighty Models for Wakeword Detection [3.4888176891918654]
Always-on machine learning models require a very low memory and compute footprint.
We show that a small convolutional model can be better trained by first its computation into a larger multi-branched architecture.
We show that our always-on wake-word detector model, RepCNN, provides a good trade-off between latency and accuracy during inference.
arXiv Detail & Related papers (2024-06-04T16:14:19Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - SITHCon: A neural network robust to variations in input scaling on the
time dimension [0.0]
In machine learning, convolutional neural networks (CNNs) have been extremely influential in both computer vision and in recognizing patterns extended over time.
This paper introduces a Scale-Invariant Temporal History Convolution network (SITHCon) that uses a logarithmically-distributed temporal memory.
arXiv Detail & Related papers (2021-07-09T18:11:50Z) - Broadcasted Residual Learning for Efficient Keyword Spotting [7.335747584353902]
We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load.
We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning.
BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively.
arXiv Detail & Related papers (2021-06-08T06:55:39Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.