Speech Augmentation Based Unsupervised Learning for Keyword Spotting
- URL: http://arxiv.org/abs/2205.14329v1
- Date: Sat, 28 May 2022 04:11:31 GMT
- Title: Speech Augmentation Based Unsupervised Learning for Keyword Spotting
- Authors: Jian Luo, Jianzong Wang, Ning Cheng, Haobin Tang, Jing Xiao
- Abstract summary: We designed a CNN-Attention architecture to conduct the KWS task.
We also proposed an unsupervised learning method to improve the robustness of KWS model.
In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods.
- Score: 29.87252331166527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigated a speech augmentation based unsupervised
learning approach for keyword spotting (KWS) task. KWS is a useful speech
application, yet also heavily depends on the labeled data. We designed a
CNN-Attention architecture to conduct the KWS task. CNN layers focus on the
local acoustic features, and attention layers model the long-time dependency.
To improve the robustness of KWS model, we also proposed an unsupervised
learning method. The unsupervised loss is based on the similarity between the
original and augmented speech features, as well as the audio reconstructing
information. Two speech augmentation methods are explored in the unsupervised
learning: speed and intensity. The experiments on Google Speech Commands V2
Dataset demonstrated that our CNN-Attention model has competitive results.
Moreover, the augmentation based unsupervised learning could further improve
the classification accuracy of KWS task. In our experiments, with augmentation
based unsupervised learning, our KWS model achieves better performance than
other unsupervised methods, such as CPC, APC, and MPC.
Related papers
- Exploring Representation Learning for Small-Footprint Keyword Spotting [11.586285744728068]
Main challenges of KWS are limited labeled data and limited available device resources.
To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model.
Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy.
arXiv Detail & Related papers (2023-03-20T07:09:26Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - On Higher Adversarial Susceptibility of Contrastive Self-Supervised
Learning [104.00264962878956]
Contrastive self-supervised learning (CSL) has managed to match or surpass the performance of supervised learning in image and video classification.
It is still largely unknown if the nature of the representation induced by the two learning paradigms is similar.
We identify the uniform distribution of data representation over a unit hypersphere in the CSL representation space as the key contributor to this phenomenon.
We devise strategies that are simple, yet effective in improving model robustness with CSL training.
arXiv Detail & Related papers (2022-07-22T03:49:50Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Characterizing the adversarial vulnerability of speech self-supervised
learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries.
The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z) - SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio.
We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification.
To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z) - Multi-task Learning with Cross Attention for Keyword Spotting [8.103605110339519]
Keywords spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase.
There is a mismatch between the training criterion (phoneme recognition) and the target task (KWS)
Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data.
arXiv Detail & Related papers (2021-07-15T22:38:16Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.