Audio classification with Dilated Convolution with Learnable Spacings
- URL: http://arxiv.org/abs/2309.13972v2
- Date: Wed, 22 Nov 2023 16:49:20 GMT
- Title: Audio classification with Dilated Convolution with Learnable Spacings
- Authors: Ismail Khalfaoui-Hassani, Timoth\'ee Masquelier and Thomas Pellegrini
- Abstract summary: Dilated convolution with learnable spacings (DCLS) is a recent convolution method in which the positions of the kernel elements are learned throughout training by backpropagation.
Here we show that DCLS is also useful for audio tagging using the AudioSet classification benchmark.
- Score: 10.89964981012741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dilated convolution with learnable spacings (DCLS) is a recent convolution
method in which the positions of the kernel elements are learned throughout
training by backpropagation. Its interest has recently been demonstrated in
computer vision (ImageNet classification and downstream tasks). Here we show
that DCLS is also useful for audio tagging using the AudioSet classification
benchmark. We took two state-of-the-art convolutional architectures using
depthwise separable convolutions (DSC), ConvNeXt and ConvFormer, and a hybrid
one using attention in addition, FastViT, and drop-in replaced all the DSC
layers by DCLS ones. This significantly improved the mean average precision
(mAP) with the three architectures without increasing the number of parameters
and with only a low cost on the throughput. The method code is based on PyTorch
and is available at https://github.com/K-H-Ismail/DCLS-Audio
Related papers
- 3D-Convolution Guided Spectral-Spatial Transformer for Hyperspectral Image Classification [12.729885732069926]
Vision Transformers (ViTs) have shown promising classification performance over Convolutional Neural Networks (CNNs)
ViTs excel with sequential data, but they cannot extract spectral-spatial information like CNNs.
We propose a 3D-Convolution guided Spectral-Spatial Transformer (3D-ConvSST) for HSI classification.
arXiv Detail & Related papers (2024-04-20T03:39:54Z) - PosCUDA: Position based Convolution for Unlearnable Audio Datasets [7.4768400786925175]
PosCUDA is a position based convolution for creating unlearnable audio datasets.
We empirically show that PosCUDA can achieve unlearnability while maintaining the quality of the original audio datasets.
arXiv Detail & Related papers (2024-01-04T08:39:49Z) - Dilated Convolution with Learnable Spacings: beyond bilinear
interpolation [10.89964981012741]
Dilated Convolution with Learnable Spacings is a proposed variation of the dilated convolution.
Non-integer positions are handled via gradients.
The method code is based on PyTorch.
arXiv Detail & Related papers (2023-06-01T15:42:08Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Focal Sparse Convolutional Networks for 3D Object Detection [121.45950754511021]
We introduce two new modules to enhance the capability of Sparse CNNs.
They are focal sparse convolution (Focals Conv) and its multi-modal variant of focal sparse convolution with fusion.
For the first time, we show that spatially learnable sparsity in sparse convolution is essential for sophisticated 3D object detection.
arXiv Detail & Related papers (2022-04-26T17:34:10Z) - Dilated convolution with learnable spacings [6.6389732792316005]
CNNs need receptive fields (RF) to compete with visual transformers.
RFs can simply be enlarged by increasing the convolution kernel sizes.
The number of trainable parameters, which scales quadratically with the kernel's size in the 2D case, rapidly becomes prohibitive.
This paper presents a new method to increase the RF size without increasing the number of parameters.
arXiv Detail & Related papers (2021-12-07T14:54:24Z) - RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for
Image Recognition [123.59890802196797]
We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition.
We construct convolutional layers inside a RepMLP during training and merge them into the FC for inference.
By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs.
arXiv Detail & Related papers (2021-05-05T06:17:40Z) - Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios.
For dataset bias due to different samplers, we propose shifted batch normalization.
Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z) - End-To-End Dilated Variational Autoencoder with Bottleneck
Discriminative Loss for Sound Morphing -- A Preliminary Study [0.0]
We present a preliminary study on an end-to-end variational autoencoder (VAE) for sound morphing.
Two VAE variants are compared: VAE with dilation layers (DC-VAE) and VAE only with regular convolutional layers (CC-VAE)
arXiv Detail & Related papers (2020-11-19T09:47:13Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.