sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks
- URL: http://arxiv.org/abs/2403.05772v1
- Date: Sat, 9 Mar 2024 02:55:44 GMT
- Title: sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks
- Authors: Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li
- Abstract summary: Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient.
This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD.
It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
- Score: 51.516451451719654
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Speech applications are expected to be low-power and robust under noisy
conditions. An effective Voice Activity Detection (VAD) front-end lowers the
computational need. Spiking Neural Networks (SNNs) are known to be biologically
plausible and power-efficient. However, SNN-based VADs have yet to achieve
noise robustness and often require large models for high performance. This
paper introduces a novel SNN-based VAD model, referred to as sVAD, which
features an auditory encoder with an SNN-based attention mechanism.
Particularly, it provides effective auditory feature representation through
SincNet and 1D convolution, and improves noise robustness with attention
mechanisms. The classifier utilizes Spiking Recurrent Neural Networks (sRNN) to
exploit temporal speech information. Experimental results demonstrate that our
sVAD achieves remarkable noise robustness and meanwhile maintains low power
consumption and a small footprint, making it a promising solution for
real-world VAD applications.
Related papers
- DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement [3.409728296852651]
Speech enhancement improves communication in noisy environments, affecting areas such as automatic speech recognition, hearing aids, and telecommunications.
Neuromorphic algorithms in the form of spiking neural networks (SNNs) have great potential.
We develop a two-phase time-domain streaming SNN framework -- the Dual-Path Spiking Neural Network (DPSNN)
arXiv Detail & Related papers (2024-08-14T09:08:43Z) - A Real-Time Voice Activity Detection Based On Lightweight Neural [4.589472292598182]
Voice activity detection (VAD) is the task of detecting speech in an audio stream.
Recent neural network-based VADs have alleviated the degradation of performance to some extent.
We propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU.
arXiv Detail & Related papers (2024-05-27T03:31:16Z) - Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks [53.31894108974566]
Spiking-LEAF is a learnable auditory front-end meticulously designed for SNN-based speech processing.
On keyword spotting and speaker identification tasks, the proposed Spiking-LEAF outperforms both SOTA spiking auditory front-ends.
arXiv Detail & Related papers (2023-09-18T04:03:05Z) - Single Channel Speech Enhancement Using U-Net Spiking Neural Networks [2.436681150766912]
Speech enhancement (SE) is crucial for reliable communication devices or robust speech recognition systems.
We propose a novel approach to SE using a spiking neural network (SNN) based on a U-Net architecture.
SNNs are suitable for processing data with a temporal dimension, such as speech, and are known for their energy-efficient implementation on neuromorphic hardware.
arXiv Detail & Related papers (2023-07-26T19:10:29Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Event Based Time-Vectors for auditory features extraction: a
neuromorphic approach for low power audio recognition [4.206844212918807]
We present a neuromorphic architecture, capable of unsupervised auditory feature recognition.
We then validate the network on a subset of Google's Speech Commands dataset.
arXiv Detail & Related papers (2021-12-13T21:08:04Z) - HASA-net: A non-intrusive hearing-aid speech assessment network [52.83357278948373]
We propose a DNN-based hearing aid speech assessment network (HASA-Net) to predict speech quality and intelligibility scores simultaneously.
To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids.
Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics.
arXiv Detail & Related papers (2021-11-10T14:10:13Z) - Robust Learning of Recurrent Neural Networks in Presence of Exogenous
Noise [22.690064709532873]
We propose a tractable robustness analysis for RNN models subject to input noise.
The robustness measure can be estimated efficiently using linearization techniques.
Our proposed methodology significantly improves robustness of recurrent neural networks.
arXiv Detail & Related papers (2021-05-03T16:45:05Z) - Deep Networks for Direction-of-Arrival Estimation in Low SNR [89.45026632977456]
We introduce a Convolutional Neural Network (CNN) that is trained from mutli-channel data of the true array manifold matrix.
We train a CNN in the low-SNR regime to predict DoAs across all SNRs.
Our robust solution can be applied in several fields, ranging from wireless array sensors to acoustic microphones or sonars.
arXiv Detail & Related papers (2020-11-17T12:52:18Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.