Speech Augmentation Based Unsupervised Learning for Keyword Spotting
- URL: http://arxiv.org/abs/2205.14329v1
- Date: Sat, 28 May 2022 04:11:31 GMT
- Title: Speech Augmentation Based Unsupervised Learning for Keyword Spotting
- Authors: Jian Luo, Jianzong Wang, Ning Cheng, Haobin Tang, Jing Xiao
- Abstract summary: We designed a CNN-Attention architecture to conduct the KWS task.
We also proposed an unsupervised learning method to improve the robustness of KWS model.
In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods.
- Score: 29.87252331166527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigated a speech augmentation based unsupervised
learning approach for keyword spotting (KWS) task. KWS is a useful speech
application, yet also heavily depends on the labeled data. We designed a
CNN-Attention architecture to conduct the KWS task. CNN layers focus on the
local acoustic features, and attention layers model the long-time dependency.
To improve the robustness of KWS model, we also proposed an unsupervised
learning method. The unsupervised loss is based on the similarity between the
original and augmented speech features, as well as the audio reconstructing
information. Two speech augmentation methods are explored in the unsupervised
learning: speed and intensity. The experiments on Google Speech Commands V2
Dataset demonstrated that our CNN-Attention model has competitive results.
Moreover, the augmentation based unsupervised learning could further improve
the classification accuracy of KWS task. In our experiments, with augmentation
based unsupervised learning, our KWS model achieves better performance than
other unsupervised methods, such as CPC, APC, and MPC.
Related papers
- Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology [4.080686348274667]
We introduce a novel approach combining unsupervised contrastive learning and a augmentation unique-based technique.
Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks.
We present a speech augmentation-based unsupervised learning method that utilizes the similarity between the bottleneck layer feature and the audio reconstructing information.
arXiv Detail & Related papers (2024-08-31T05:40:37Z) - Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting [18.456711824241978]
We propose datasource-aware disentangled learning with adversarial examples to improve KWS robustness.
Experimental results demonstrate that the proposed learning strategy improves false reject rate by $40.31%$ at $1%$ false accept rate.
Our best-performing system achieves $98.06%$ accuracy on the Google Speech Commands V1 dataset.
arXiv Detail & Related papers (2024-08-23T20:03:51Z) - Exploring Representation Learning for Small-Footprint Keyword Spotting [11.586285744728068]
Main challenges of KWS are limited labeled data and limited available device resources.
To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model.
Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy.
arXiv Detail & Related papers (2023-03-20T07:09:26Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - On Higher Adversarial Susceptibility of Contrastive Self-Supervised
Learning [104.00264962878956]
Contrastive self-supervised learning (CSL) has managed to match or surpass the performance of supervised learning in image and video classification.
It is still largely unknown if the nature of the representation induced by the two learning paradigms is similar.
We identify the uniform distribution of data representation over a unit hypersphere in the CSL representation space as the key contributor to this phenomenon.
We devise strategies that are simple, yet effective in improving model robustness with CSL training.
arXiv Detail & Related papers (2022-07-22T03:49:50Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Characterizing the adversarial vulnerability of speech self-supervised
learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries.
The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z) - SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio.
We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification.
To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.