Frequency Gating: Improved Convolutional Neural Networks for Speech
Enhancement in the Time-Frequency Domain
- URL: http://arxiv.org/abs/2011.04092v1
- Date: Sun, 8 Nov 2020 22:04:00 GMT
- Title: Frequency Gating: Improved Convolutional Neural Networks for Speech
Enhancement in the Time-Frequency Domain
- Authors: Koen Oostermeijer, Qing Wang and Jun Du
- Abstract summary: We introduce a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN.
Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline.
A loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.
- Score: 37.722450363816144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the strengths of traditional convolutional neural networks (CNNs) is
their inherent translational invariance. However, for the task of speech
enhancement in the time-frequency domain, this property cannot be fully
exploited due to a lack of invariance in the frequency direction. In this paper
we propose to remedy this inefficiency by introducing a method, which we call
Frequency Gating, to compute multiplicative weights for the kernels of the CNN
in order to make them frequency dependent. Several mechanisms are explored:
temporal gating, in which weights are dependent on prior time frames, local
gating, whose weights are generated based on a single time frame and the ones
adjacent to it, and frequency-wise gating, where each kernel is assigned a
weight independent of the input data. Experiments with an autoencoder neural
network with skip connections show that both local and frequency-wise gating
outperform the baseline and are therefore viable ways to improve CNN-based
speech enhancement neural networks. In addition, a loss function based on the
extended short-time objective intelligibility score (ESTOI) is introduced,
which we show to outperform the standard mean squared error (MSE) loss
function.
Related papers
- Fitting Auditory Filterbanks with Multiresolution Neural Networks [4.944919495794613]
We introduce a neural audio model, named multiresolution neural network (MuReNN)
The key idea behind MuReNN is to train separate convolutional operators over the octave subbands of a discrete wavelet transform (DWT)
For a given real-world dataset, we fit the magnitude response of MuReNN to that of a well-established auditory filterbank.
arXiv Detail & Related papers (2023-07-25T21:20:12Z) - Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [1.4277428617774877]
We present Vocos, a new model that directly generates Fourier spectral coefficients.
It substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches.
arXiv Detail & Related papers (2023-06-01T15:40:32Z) - A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree
Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions.
We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets.
We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z) - Properties and Potential Applications of Random Functional-Linked Types
of Neural Networks [81.56822938033119]
Random functional-linked neural networks (RFLNNs) offer an alternative way of learning in deep structure.
This paper gives some insights into the properties of RFLNNs from the viewpoints of frequency domain.
We propose a method to generate a BLS network with better performance, and design an efficient algorithm for solving Poison's equation.
arXiv Detail & Related papers (2023-04-03T13:25:22Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - TFN: An Interpretable Neural Network with Time-Frequency Transform
Embedded for Intelligent Fault Diagnosis [6.812133175214715]
Convolutional Neural Networks (CNNs) are widely used in fault diagnosis of mechanical systems.
We propose a novel interpretable neural network termed as Time-Frequency Network (TFN), where the physically meaningful time-frequency transform (TFT) method is embedded into the traditional convolutional layer as an adaptive preprocessing layer.
In this study, four typical TFT methods are considered to formulate the TFNs and their effectiveness and interpretability are proved through three mechanical fault diagnosis experiments.
arXiv Detail & Related papers (2022-09-05T14:48:52Z) - Spike-inspired Rank Coding for Fast and Accurate Recurrent Neural
Networks [5.986408771459261]
Biological spiking neural networks (SNNs) can temporally encode information in their outputs, whereas artificial neural networks (ANNs) conventionally do not.
Here we show that temporal coding such as rank coding (RC) inspired by SNNs can also be applied to conventional ANNs such as LSTMs.
RC-training also significantly reduces time-to-insight during inference, with a minimal decrease in accuracy.
We demonstrate these in two toy problems of sequence classification, and in a temporally-encoded MNIST dataset where our RC model achieves 99.19% accuracy after the first input time-step
arXiv Detail & Related papers (2021-10-06T15:51:38Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Robust Learning with Frequency Domain Regularization [1.370633147306388]
We introduce a new regularization method by constraining the frequency spectra of the filter of the model.
We demonstrate the effectiveness of our regularization by (1) defensing to adversarial perturbations; (2) reducing the generalization gap in different architecture; and (3) improving the generalization ability in transfer learning scenario without fine-tune.
arXiv Detail & Related papers (2020-07-07T07:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.