Sound Event Detection with Depthwise Separable and Dilated Convolutions
- URL: http://arxiv.org/abs/2002.00476v1
- Date: Sun, 2 Feb 2020 19:50:51 GMT
- Title: Sound Event Detection with Depthwise Separable and Dilated Convolutions
- Authors: Konstantinos Drossos and Stylianos I. Mimilakis and Shayan Gharib and
Yanxiong Li and Tuomas Virtanen
- Abstract summary: State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal.
We propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions.
We achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively.
- Score: 23.104644393058123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art sound event detection (SED) methods usually employ a series
of convolutional neural networks (CNNs) to extract useful features from the
input audio signal, and then recurrent neural networks (RNNs) to model longer
temporal context in the extracted features. The number of the channels of the
CNNs and size of the weight matrices of the RNNs have a direct effect on the
total amount of parameters of the SED method, which is to a couple of millions.
Additionally, the usually long sequences that are used as an input to an SED
method along with the employment of an RNN, introduce implications like
increased training time, difficulty at gradient flow, and impeding the
parallelization of the SED method. To tackle all these problems, we propose the
replacement of the CNNs with depthwise separable convolutions and the
replacement of the RNNs with dilated convolutions. We compare the proposed
method to a baseline convolutional neural network on a SED task, and achieve a
reduction of the amount of parameters by 85% and average training time per
epoch by 78%, and an increase the average frame-wise F1 score and reduction of
the average error rate by 4.6% and 3.8%, respectively.
Related papers
- Patch-CNN: Training data-efficient deep learning for high-fidelity
diffusion tensor estimation from minimal diffusion protocols [3.0416974614291226]
We propose a new method, Patch-CNN, for diffusion tensor (DT) estimation from only six-direction diffusion weighted images (DWI)
Compared with image-wise FCNs, the minimal kernel vastly reduces training data demand.
The improved fibre orientation estimation is shown to produce improved tractogram.
arXiv Detail & Related papers (2023-07-03T20:39:48Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Training High-Performance Low-Latency Spiking Neural Networks by
Differentiation on Spike Representation [70.75043144299168]
Spiking Neural Network (SNN) is a promising energy-efficient AI model when implemented on neuromorphic hardware.
It is a challenge to efficiently train SNNs due to their non-differentiability.
We propose the Differentiation on Spike Representation (DSR) method, which could achieve high performance.
arXiv Detail & Related papers (2022-05-01T12:44:49Z) - Selfish Sparse RNN Training [13.165729746380816]
We propose an approach to train sparse RNNs with a fixed parameter count in one single run, without compromising performance.
We achieve state-of-the-art sparse training results with various datasets on Penn TreeBank and Wikitext-2.
arXiv Detail & Related papers (2021-01-22T10:45:40Z) - Deep Networks for Direction-of-Arrival Estimation in Low SNR [89.45026632977456]
We introduce a Convolutional Neural Network (CNN) that is trained from mutli-channel data of the true array manifold matrix.
We train a CNN in the low-SNR regime to predict DoAs across all SNRs.
Our robust solution can be applied in several fields, ranging from wireless array sensors to acoustic microphones or sonars.
arXiv Detail & Related papers (2020-11-17T12:52:18Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Depthwise Separable Convolutions Versus Recurrent Neural Networks for
Monaural Singing Voice Separation [17.358040670413505]
We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs)
We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance.
Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.
arXiv Detail & Related papers (2020-07-06T12:32:34Z) - Progressive Tandem Learning for Pattern Recognition with Deep Spiking
Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency.
We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.