Deep Spoken Keyword Spotting: An Overview
- URL: http://arxiv.org/abs/2111.10592v1
- Date: Sat, 20 Nov 2021 13:46:57 GMT
- Title: Deep Spoken Keyword Spotting: An Overview
- Authors: Iv\'an L\'opez-Espejo and Zheng-Hua Tan and John Hansen and Jesper
Jensen
- Abstract summary: Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams.
Deep KWS has become a hot research topic among speech scientists.
- Score: 28.33535370965807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS.
Related papers
- Multitaper mel-spectrograms for keyword spotting [42.82842124247846]
This paper investigates the use of the multitaper technique to create improved features for KWS.
Experiment results confirm the advantages of using the proposed improved features.
arXiv Detail & Related papers (2024-07-05T17:18:25Z) - What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - Deep Neural Networks for Automatic Speaker Recognition Do Not Learn
Supra-Segmental Temporal Features [2.724035499453558]
We present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST.
We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced.
arXiv Detail & Related papers (2023-11-01T12:45:31Z) - Speech Augmentation Based Unsupervised Learning for Keyword Spotting [29.87252331166527]
We designed a CNN-Attention architecture to conduct the KWS task.
We also proposed an unsupervised learning method to improve the robustness of KWS model.
In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods.
arXiv Detail & Related papers (2022-05-28T04:11:31Z) - Deep Learning for Visual Speech Analysis: A Survey [54.53032361204449]
This paper presents a review of recent progress in deep learning methods on visual speech analysis.
We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance.
arXiv Detail & Related papers (2022-05-22T14:44:53Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - Threat of Adversarial Attacks on Deep Learning in Computer Vision:
Survey II [86.51135909513047]
Deep Learning is vulnerable to adversarial attacks that can manipulate its predictions.
This article reviews the contributions made by the computer vision community in adversarial attacks on deep learning.
It provides definitions of technical terminologies for non-experts in this domain.
arXiv Detail & Related papers (2021-08-01T08:54:47Z) - SoK: The Faults in our ASRs: An Overview of Attacks against Automatic
Speech Recognition and Speaker Identification Systems [28.635467696564703]
We show that the end-to-end architecture of speech and speaker systems makes attacks and defenses against them substantially different than those in the image space.
We then demonstrate experimentally that attacks against these models almost universally fail to transfer.
arXiv Detail & Related papers (2020-07-13T18:52:25Z) - Exploring Filterbank Learning for Keyword Spotting [27.319236923928205]
This paper explores filterbank learning for keyword spotting (KWS)
Two approaches are examined: filterbank matrix learning in the power spectral domain and parameter learning of a psychoacoustically-motivated gammachirp filterbank.
Our experimental results reveal that, in general, there are no statistically significant differences, in terms of KWS accuracy, between using a learned filterbank and handcrafted speech features.
arXiv Detail & Related papers (2020-05-30T08:11:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.