Fitting Auditory Filterbanks with Multiresolution Neural Networks
- URL: http://arxiv.org/abs/2307.13821v1
- Date: Tue, 25 Jul 2023 21:20:12 GMT
- Title: Fitting Auditory Filterbanks with Multiresolution Neural Networks
- Authors: Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter
Balazs, Martin Ehler
- Abstract summary: We introduce a neural audio model, named multiresolution neural network (MuReNN)
The key idea behind MuReNN is to train separate convolutional operators over the octave subbands of a discrete wavelet transform (DWT)
For a given real-world dataset, we fit the magnitude response of MuReNN to that of a well-established auditory filterbank.
- Score: 4.944919495794613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Waveform-based deep learning faces a dilemma between nonparametric and
parametric approaches. On one hand, convolutional neural networks (convnets)
may approximate any linear time-invariant system; yet, in practice, their
frequency responses become more irregular as their receptive fields grow. On
the other hand, a parametric model such as LEAF is guaranteed to yield Gabor
filters, hence an optimal time-frequency localization; yet, this strong
inductive bias comes at the detriment of representational capacity. In this
paper, we aim to overcome this dilemma by introducing a neural audio model,
named multiresolution neural network (MuReNN). The key idea behind MuReNN is to
train separate convolutional operators over the octave subbands of a discrete
wavelet transform (DWT). Since the scale of DWT atoms grows exponentially
between octaves, the receptive fields of the subsequent learnable convolutions
in MuReNN are dilated accordingly. For a given real-world dataset, we fit the
magnitude response of MuReNN to that of a well-established auditory filterbank:
Gammatone for speech, CQT for music, and third-octave for urban sounds,
respectively. This is a form of knowledge distillation (KD), in which the
filterbank ''teacher'' is engineered by domain knowledge while the neural
network ''student'' is optimized from data. We compare MuReNN to the state of
the art in terms of goodness of fit after KD on a hold-out set and in terms of
Heisenberg time-frequency localization. Compared to convnets and Gabor
convolutions, we find that MuReNN reaches state-of-the-art performance on all
three optimization problems.
Related papers
- Accurate Mapping of RNNs on Neuromorphic Hardware with Adaptive Spiking Neurons [2.9410174624086025]
We present a $SigmaDelta$-low-pass RNN (lpRNN) for mapping rate-based RNNs to spiking neural networks (SNNs)
An adaptive spiking neuron model encodes signals using $SigmaDelta$-modulation and enables precise mapping.
We demonstrate the implementation of the lpRNN on Intel's neuromorphic research chip Loihi.
arXiv Detail & Related papers (2024-07-18T14:06:07Z) - Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models [42.39774323584976]
We propose a deep learning based system for the task of deepfake audio detection.
In particular, the draw input audio is first transformed into various spectrograms.
We leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings.
arXiv Detail & Related papers (2024-07-01T20:10:43Z) - Instabilities in Convnets for Raw Audio [1.5060156580765574]
We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights.
We find that deviations worsen for large filters and locally periodic input signals.
Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law.
arXiv Detail & Related papers (2023-09-11T22:34:06Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Spiking Neural Network Decision Feedback Equalization [70.3497683558609]
We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE)
We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels.
The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.
arXiv Detail & Related papers (2022-11-09T09:19:15Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Frequency Gating: Improved Convolutional Neural Networks for Speech
Enhancement in the Time-Frequency Domain [37.722450363816144]
We introduce a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN.
Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline.
A loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.
arXiv Detail & Related papers (2020-11-08T22:04:00Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Depthwise Separable Convolutions Versus Recurrent Neural Networks for
Monaural Singing Voice Separation [17.358040670413505]
We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs)
We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance.
Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.
arXiv Detail & Related papers (2020-07-06T12:32:34Z) - Approximation and Non-parametric Estimation of ResNet-type Convolutional
Neural Networks [52.972605601174955]
We show a ResNet-type CNN can attain the minimax optimal error rates in important function classes.
We derive approximation and estimation error rates of the aformentioned type of CNNs for the Barron and H"older classes.
arXiv Detail & Related papers (2019-03-24T19:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.