Exploring Representation Learning for Small-Footprint Keyword Spotting
- URL: http://arxiv.org/abs/2303.10912v1
- Date: Mon, 20 Mar 2023 07:09:26 GMT
- Title: Exploring Representation Learning for Small-Footprint Keyword Spotting
- Authors: Fan Cui, Liyong Guo, Quandong Wang, Peng Gao, Yujun Wang
- Abstract summary: Main challenges of KWS are limited labeled data and limited available device resources.
To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model.
Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy.
- Score: 11.586285744728068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate representation learning for low-resource
keyword spotting (KWS). The main challenges of KWS are limited labeled data and
limited available device resources. To address those challenges, we explore
representation learning for KWS by self-supervised contrastive learning and
self-training with pretrained model. First, local-global contrastive siamese
networks (LGCSiam) are designed to learn similar utterance-level
representations for similar audio samplers by proposed local-global contrastive
loss without requiring ground-truth. Second, a self-supervised pretrained
Wav2Vec 2.0 model is applied as a constraint module (WVC) to force the KWS
model to learn frame-level acoustic representations. By the LGCSiam and WVC
modules, the proposed small-footprint KWS model can be pretrained with
unlabeled data. Experiments on speech commands dataset show that the
self-training WVC module and the self-supervised LGCSiam module significantly
improve accuracy, especially in the case of training on a small labeled
Related papers
- Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting [18.456711824241978]
We propose datasource-aware disentangled learning with adversarial examples to improve KWS robustness.
Experimental results demonstrate that the proposed learning strategy improves false reject rate by $40.31%$ at $1%$ false accept rate.
Our best-performing system achieves $98.06%$ accuracy on the Google Speech Commands V1 dataset.
arXiv Detail & Related papers (2024-08-23T20:03:51Z) - Noise-Robust Keyword Spotting through Self-supervised Pretraining [11.90089857382705]
Self-supervised learning has been shown to increase the accuracy in clean conditions.
This paper explores how SSL pretraining can be used to enhance the robustness of KWS models in noisy conditions.
arXiv Detail & Related papers (2024-03-27T13:42:14Z) - Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition [72.35438297011176]
We propose a novel method to realize seamless adaptation of pre-trained models for visual place recognition (VPR)
Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method.
Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time.
arXiv Detail & Related papers (2024-02-22T12:55:01Z) - Improving Label-Deficient Keyword Spotting Through Self-Supervised
Pretraining [18.19207291891767]
Keywords Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants.
KWS models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available.
Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled data.
arXiv Detail & Related papers (2022-10-04T15:56:27Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Raw waveform speaker verification for supervised and self-supervised
learning [30.08242210230669]
This paper proposes a new raw waveform speaker verification model that incorporates techniques proven effective for speaker verification.
Under the best performing configuration, the model shows an equal error rate of 0.89%, competitive with state-of-the-art models.
We also explore the proposed model with a self-supervised learning framework and show the state-of-the-art performance in this line of research.
arXiv Detail & Related papers (2022-03-16T09:28:03Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.