ESResNet: Environmental Sound Classification Based on Visual Domain
Models
- URL: http://arxiv.org/abs/2004.07301v1
- Date: Wed, 15 Apr 2020 19:07:55 GMT
- Title: ESResNet: Environmental Sound Classification Based on Visual Domain
Models
- Authors: Andrey Guzhov, Federico Raue, J\"orn Hees and Andreas Dengel
- Abstract summary: We present a model that is inherently compatible with mono and stereo sound inputs.
We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets.
- Score: 4.266320191208303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Environmental Sound Classification (ESC) is an active research area in the
audio domain and has seen a lot of progress in the past years. However, many of
the existing approaches achieve high accuracy by relying on domain-specific
features and architectures, making it harder to benefit from advances in other
fields (e.g., the image domain). Additionally, some of the past successes have
been attributed to a discrepancy of how results are evaluated (i.e., on
unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall
progression of the field.
The contribution of this paper is twofold. First, we present a model that is
inherently compatible with mono and stereo sound inputs. Our model is based on
simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines
them with several well-known approaches from the image domain (i.e., ResNet,
Siamese-like networks and attention). We investigate the influence of
cross-domain pre-training, architectural changes, and evaluate our model on
standard datasets. We find that our model out-performs all previously known
approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10),
91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo).
Second, we provide a comprehensive overview of the actual state of the field,
by differentiating several previously reported results on the US8K dataset
between official or unofficial splits. For better reproducibility, our code
(including any re-implementations) is made available.
Related papers
- CFDP: Common Frequency Domain Pruning [0.3021678014343889]
We introduce a novel end-to-end pipeline for model pruning via the frequency domain.
We have achieved state-of-the-art results on CIFAR-10 with GoogLeNet reaching an accuracy of 95.25%, that is, +0.2% from the original model.
In addition to notable performances, models produced via CFDP exhibit robustness to a variety of configurations.
arXiv Detail & Related papers (2023-06-07T04:49:26Z) - Fairness meets Cross-Domain Learning: a new perspective on Models and
Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness.
We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks.
Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z) - FIXED: Frustratingly Easy Domain Generalization with Mixup [53.782029033068675]
Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains.
A popular strategy is to augment training data to benefit generalization through methods such as Mixupcitezhang 2018mixup.
We propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX)
Our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5% on average in terms of test accuracy.
arXiv Detail & Related papers (2022-11-07T09:38:34Z) - AudioCLIP: Extending CLIP to Image, Text and Audio [6.585049648605185]
We present an extension of the CLIP model that handles audio in addition to text and images.
Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset.
AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task.
arXiv Detail & Related papers (2021-06-24T14:16:38Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Perceiver: General Perception with Iterative Attention [85.65927856589613]
We introduce the Perceiver - a model that builds upon Transformers.
We show that this architecture performs competitively or beyond strong, specialized models on classification tasks.
It also surpasses state-of-the-art results for all modalities in AudioSet.
arXiv Detail & Related papers (2021-03-04T18:20:50Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - Urban Sound Classification : striving towards a fair comparison [0.0]
We present our DCASE 2020 task 5 winning solution which aims at helping the monitoring of urban noise pollution.
It achieves a macro-AUPRC of 0.82 / 0.62 for the coarse / fine classification on validation set.
It reaches accuracies of 89.7% and 85.41% respectively on ESC-50 and US8k datasets.
arXiv Detail & Related papers (2020-10-22T15:37:39Z) - $n$-Reference Transfer Learning for Saliency Prediction [73.17061116358036]
We propose a few-shot transfer learning paradigm for saliency prediction.
The proposed framework is gradient-based and model-agnostic.
The results show that the proposed framework achieves a significant performance improvement.
arXiv Detail & Related papers (2020-07-09T23:20:44Z) - Learning Meta Face Recognition in Unseen Domains [74.69681594452125]
We propose a novel face recognition method via meta-learning named Meta Face Recognition (MFR)
MFR synthesizes the source/target domain shift with a meta-optimization objective.
We propose two benchmarks for generalized face recognition evaluation.
arXiv Detail & Related papers (2020-03-17T14:10:30Z) - Acoustic Scene Classification Using Bilinear Pooling on Time-liked and
Frequency-liked Convolution Neural Network [4.131608702779222]
This paper explores the use of harmonic and percussive source separation (HPSS) to split the audio into harmonic audio and percussive audio.
The deep features extracted from these 2 CNNs will then be combined using bilinear pooling.
The model is being evaluated on DCASE 2019 sub task 1a dataset and scored an average of 65% on development dataset.
arXiv Detail & Related papers (2020-02-14T04:06:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.