Resource-Efficient Speech Mask Estimation for Multi-Channel Speech
Enhancement
- URL: http://arxiv.org/abs/2007.11477v1
- Date: Wed, 22 Jul 2020 14:58:29 GMT
- Title: Resource-Efficient Speech Mask Estimation for Multi-Channel Speech
Enhancement
- Authors: Lukas Pfeifenberger, Matthias Z\"ohrer, G\"unther Schindler, Wolfgang
Roth, Holger Fr\"oning and Franz Pernkopf
- Abstract summary: We provide a resource-efficient approach for multi-channel speech enhancement based on Deep Neural Networks (DNNs)
In particular, we use reduced-precision DNNs for estimating a speech mask from noisy, multi-channel microphone observations.
In the extreme case of binary weights and reduced precision activations, a significant reduction of execution time and memory footprint is possible.
- Score: 15.361841669377776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While machine learning techniques are traditionally resource intensive, we
are currently witnessing an increased interest in hardware and energy efficient
approaches. This need for resource-efficient machine learning is primarily
driven by the demand for embedded systems and their usage in ubiquitous
computing and IoT applications. In this article, we provide a
resource-efficient approach for multi-channel speech enhancement based on Deep
Neural Networks (DNNs). In particular, we use reduced-precision DNNs for
estimating a speech mask from noisy, multi-channel microphone observations.
This speech mask is used to obtain either the Minimum Variance Distortionless
Response (MVDR) or Generalized Eigenvalue (GEV) beamformer. In the extreme case
of binary weights and reduced precision activations, a significant reduction of
execution time and memory footprint is possible while still obtaining an audio
quality almost on par to single-precision DNNs and a slightly larger Word Error
Rate (WER) for single speaker scenarios using the WSJ0 speech corpus.
Related papers
- Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps [4.002057316863807]
We investigate binary activation maps (BAMs) for speech quality prediction on a convolutional architecture based on DNSMOS.
We show that the binary activation model with quantization aware training matches the predictive performance of the baseline model.
Our approach results in a 25-fold memory reduction during inference, while replacing almost all dot products with summations.
arXiv Detail & Related papers (2024-07-05T15:15:00Z) - sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks [51.516451451719654]
Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient.
This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD.
It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
arXiv Detail & Related papers (2024-03-09T02:55:44Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Keyword spotting -- Detecting commands in speech using deep learning [2.709166684084394]
We implement feature engineering by converting raw waveforms to Mel Frequency Cepstral Coefficients (MFCCs)
In our experiments, RNN with BiLSTM and Attention achieves the best performance with an accuracy of 93.9 %.
arXiv Detail & Related papers (2023-12-09T19:04:17Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Event Based Time-Vectors for auditory features extraction: a
neuromorphic approach for low power audio recognition [4.206844212918807]
We present a neuromorphic architecture, capable of unsupervised auditory feature recognition.
We then validate the network on a subset of Google's Speech Commands dataset.
arXiv Detail & Related papers (2021-12-13T21:08:04Z) - Broadcasted Residual Learning for Efficient Keyword Spotting [7.335747584353902]
We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load.
We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning.
BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively.
arXiv Detail & Related papers (2021-06-08T06:55:39Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - TinySpeech: Attention Condensers for Deep Speech Recognition Neural
Networks on Edge Devices [71.68436132514542]
We introduce the concept of attention condensers for building low-footprint, highly-efficient deep neural networks for on-device speech recognition on the edge.
To illustrate its efficacy, we introduce TinySpeech, low-precision deep neural networks tailored for on-device speech recognition.
arXiv Detail & Related papers (2020-08-10T16:34:52Z) - Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.