Urban Sound Classification : striving towards a fair comparison
- URL: http://arxiv.org/abs/2010.11805v1
- Date: Thu, 22 Oct 2020 15:37:39 GMT
- Title: Urban Sound Classification : striving towards a fair comparison
- Authors: Augustin Arnault, Baptiste Hanssens and Nicolas Riche
- Abstract summary: We present our DCASE 2020 task 5 winning solution which aims at helping the monitoring of urban noise pollution.
It achieves a macro-AUPRC of 0.82 / 0.62 for the coarse / fine classification on validation set.
It reaches accuracies of 89.7% and 85.41% respectively on ESC-50 and US8k datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Urban sound classification has been achieving remarkable progress and is
still an active research area in audio pattern recognition. In particular, it
allows to monitor the noise pollution, which becomes a growing concern for
large cities. The contribution of this paper is two-fold. First, we present our
DCASE 2020 task 5 winning solution which aims at helping the monitoring of
urban noise pollution. It achieves a macro-AUPRC of 0.82 / 0.62 for the coarse
/ fine classification on validation set. Moreover, it reaches accuracies of
89.7% and 85.41% respectively on ESC-50 and US8k datasets. Second, it is not
easy to find a fair comparison and to reproduce the performance of existing
models. Sometimes authors copy-pasting the results of the original papers which
is not helping reproducibility. As a result, we provide a fair comparison by
using the same input representation, metrics and optimizer to assess
performances. We preserve data augmentation used by the original papers. We
hope this framework could help evaluate new architectures in this field. For
better reproducibility, the code is available on our GitHub repository.
Related papers
- Less is More: Fewer Interpretable Region via Submodular Subset Selection [54.07758302264416]
This paper re-models the above image attribution problem as a submodular subset selection problem.
We construct a novel submodular function to discover more accurate small interpretation regions.
For correctly predicted samples, the proposed method improves the Deletion and Insertion scores with an average of 4.9% and 2.5% gain relative to HSIC-Attribution.
arXiv Detail & Related papers (2024-02-14T13:30:02Z) - SparseVSR: Lightweight and Noise Robust Visual Speech Recognition [100.43280310123784]
We generate a lightweight model that achieves higher performance than its dense model equivalent.
Our results confirm that sparse networks are more resistant to noise than dense networks.
arXiv Detail & Related papers (2023-07-10T13:34:13Z) - Optimizing the Noise in Self-Supervised Learning: from Importance
Sampling to Noise-Contrastive Estimation [80.07065346699005]
It is widely assumed that the optimal noise distribution should be made equal to the data distribution, as in Generative Adversarial Networks (GANs)
We turn to Noise-Contrastive Estimation which grounds this self-supervised task as an estimation problem of an energy-based model of the data.
We soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.
arXiv Detail & Related papers (2023-01-23T19:57:58Z) - Image Classification with Small Datasets: Overview and Benchmark [0.0]
We systematically organize and connect past studies to consolidate a community that is currently fragmented and scattered.
We propose a common benchmark that allows for an objective comparison of approaches.
We use this benchmark to re-evaluate the standard cross-entropy baseline and ten existing methods published between 2017 and 2021 at renowned venues.
arXiv Detail & Related papers (2022-12-23T17:11:16Z) - A Closer Look at Weakly-Supervised Audio-Visual Source Localization [26.828874753756523]
Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video.
We extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples.
We also propose a new approach for visual sound source localization that addresses both these problems.
arXiv Detail & Related papers (2022-08-30T14:17:46Z) - A Study on Robustness to Perturbations for Representations of
Environmental Sound [16.361059909912758]
We evaluate two embeddings -- YAMNet, and OpenL$3$ on monophonic (UrbanSound8K) and polyphonic (SONYC UST) datasets.
We imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new embeddings with three distance measures.
arXiv Detail & Related papers (2022-03-20T01:04:38Z) - Learning with Noisy Labels Revisited: A Study Using Real-World Human
Annotations [54.400167806154535]
Existing research on learning with noisy labels mainly focuses on synthetic label noise.
This work presents two new benchmark datasets (CIFAR-10N, CIFAR-100N)
We show that real-world noisy labels follow an instance-dependent pattern rather than the classically adopted class-dependent ones.
arXiv Detail & Related papers (2021-10-22T22:42:11Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z) - ESResNet: Environmental Sound Classification Based on Visual Domain
Models [4.266320191208303]
We present a model that is inherently compatible with mono and stereo sound inputs.
We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets.
arXiv Detail & Related papers (2020-04-15T19:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.