Related papers: Keyword spotting -- Detecting commands in speech using deep learning

Keyword spotting -- Detecting commands in speech using deep learning

URL: http://arxiv.org/abs/2312.05640v1
Date: Sat, 9 Dec 2023 19:04:17 GMT
Title: Keyword spotting -- Detecting commands in speech using deep learning
Authors: Sumedha Rai, Tong Li, Bella Lyu
Abstract summary: We implement feature engineering by converting raw waveforms to Mel Frequency Cepstral Coefficients (MFCCs) In our experiments, RNN with BiLSTM and Attention achieves the best performance with an accuracy of 93.9 %.
Score: 2.709166684084394
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Speech recognition has become an important task in the development of machine learning and artificial intelligence. In this study, we explore the important task of keyword spotting using speech recognition machine learning and deep learning techniques. We implement feature engineering by converting raw waveforms to Mel Frequency Cepstral Coefficients (MFCCs), which we use as inputs to our models. We experiment with several different algorithms such as Hidden Markov Model with Gaussian Mixture, Convolutional Neural Networks and variants of Recurrent Neural Networks including Long Short-Term Memory and the Attention mechanism. In our experiments, RNN with BiLSTM and Attention achieves the best performance with an accuracy of 93.9 %

Related papers

SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition [17.568724398229232]
Speech emotion recognition (SER) plays an important role in emotional states from deciphering speech signals. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER. It exploits the capabilities of wavelets for effective localization in both time and frequency domains.
arXiv Detail & Related papers (2025-02-01T04:18:06Z)
Understanding Auditory Evoked Brain Signal via Physics-informed Embedding Network with Multi-Task Transformer [3.261870217889503]
We propose an innovative multi-task learning model, Physics-informed Embedding Network with Multi-Task Transformer (PEMT-Net) PEMT-Net enhances decoding performance through physics-informed embedding and deep learning techniques. Experiments on a specific dataset demonstrate PEMT-Net's significant performance in multi-task auditory signal decoding.
arXiv Detail & Related papers (2024-06-04T06:53:32Z)
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
Surrogate Gradient Spiking Neural Networks as Encoders for Large Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method. They have shown promising results on speech command recognition tasks. In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z)
Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
Speech Command Recognition in Computationally Constrained Environments with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network. The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers. This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z)
Knowing What to Listen to: Early Attention for Deep Speech Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals. This model is capable of focusing on information items as small as frequency bins. We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z)
TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices [71.68436132514542]
We introduce the concept of attention condensers for building low-footprint, highly-efficient deep neural networks for on-device speech recognition on the edge. To illustrate its efficacy, we introduce TinySpeech, low-precision deep neural networks tailored for on-device speech recognition.
arXiv Detail & Related papers (2020-08-10T16:34:52Z)
A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture. We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z)
Incremental Training of a Recurrent Neural Network Exploiting a Multi-Scale Dynamic Memory [79.42778415729475]
We propose a novel incrementally trained recurrent architecture targeting explicitly multi-scale learning. We show how to extend the architecture of a simple RNN by separating its hidden state into different modules. We discuss a training algorithm where new modules are iteratively added to the model to learn progressively longer dependencies.
arXiv Detail & Related papers (2020-06-29T08:35:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.