Related papers: Deep Spoken Keyword Spotting: An Overview

Deep Spoken Keyword Spotting: An Overview

URL: http://arxiv.org/abs/2111.10592v1
Date: Sat, 20 Nov 2021 13:46:57 GMT
Title: Deep Spoken Keyword Spotting: An Overview
Authors: Iv\'an L\'opez-Espejo and Zheng-Hua Tan and John Hansen and Jesper Jensen
Abstract summary: Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams. Deep KWS has become a hot research topic among speech scientists.
Score: 28.33535370965807
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.

Related papers

SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [52.29009595100625]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z)
Multitaper mel-spectrograms for keyword spotting [42.82842124247846]
This paper investigates the use of the multitaper technique to create improved features for KWS. Experiment results confirm the advantages of using the proposed improved features.
arXiv Detail & Related papers (2024-07-05T17:18:25Z)
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z)
Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features [2.724035499453558]
We present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced.
arXiv Detail & Related papers (2023-11-01T12:45:31Z)
Speech Augmentation Based Unsupervised Learning for Keyword Spotting [29.87252331166527]
We designed a CNN-Attention architecture to conduct the KWS task. We also proposed an unsupervised learning method to improve the robustness of KWS model. In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods.
arXiv Detail & Related papers (2022-05-28T04:11:31Z)
Deep Learning for Visual Speech Analysis: A Survey [54.53032361204449]
This paper presents a review of recent progress in deep learning methods on visual speech analysis. We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance.
arXiv Detail & Related papers (2022-05-22T14:44:53Z)
Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. We develop a two-branch deep network (KWS branch and SV branch) with the same network structure. A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z)
Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies. This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z)
Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
Threat of Adversarial Attacks on Deep Learning in Computer Vision: Survey II [86.51135909513047]
Deep Learning is vulnerable to adversarial attacks that can manipulate its predictions. This article reviews the contributions made by the computer vision community in adversarial attacks on deep learning. It provides definitions of technical terminologies for non-experts in this domain.
arXiv Detail & Related papers (2021-08-01T08:54:47Z)
SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems [28.635467696564703]
We show that the end-to-end architecture of speech and speaker systems makes attacks and defenses against them substantially different than those in the image space. We then demonstrate experimentally that attacks against these models almost universally fail to transfer.
arXiv Detail & Related papers (2020-07-13T18:52:25Z)
Exploring Filterbank Learning for Keyword Spotting [27.319236923928205]
This paper explores filterbank learning for keyword spotting (KWS) Two approaches are examined: filterbank matrix learning in the power spectral domain and parameter learning of a psychoacoustically-motivated gammachirp filterbank. Our experimental results reveal that, in general, there are no statistically significant differences, in terms of KWS accuracy, between using a learned filterbank and handcrafted speech features.
arXiv Detail & Related papers (2020-05-30T08:11:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.