TASK3 DCASE2021 Challenge: Sound event localization and detection using
squeeze-excitation residual CNNs
- URL: http://arxiv.org/abs/2107.14561v1
- Date: Fri, 30 Jul 2021 11:34:15 GMT
- Title: TASK3 DCASE2021 Challenge: Sound event localization and detection using
squeeze-excitation residual CNNs
- Authors: Javier Naranjo-Alcazar, Sergi Perez-Castanos, Pedro Zuccarello,
Francesc J. Ferri, Maximo Cobos
- Abstract summary: This study is based on the one carried out by the same team last year.
It has been decided to study how this technique improves each of the datasets.
This modification shows an improvement in the performance of the system compared to the baseline using MIC dataset.
- Score: 4.4973334555746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound event localisation and detection (SELD) is a problem in the field of
automatic listening that aims at the temporal detection and localisation
(direction of arrival estimation) of sound events within an audio clip, usually
of long duration. Due to the amount of data present in the datasets related to
this problem, solutions based on deep learning have positioned themselves at
the top of the state of the art. Most solutions are based on 2D representations
of the audio (different spectrograms) that are processed by a
convolutional-recurrent network. The motivation of this submission is to study
the squeeze-excitation technique in the convolutional part of the network and
how it improves the performance of the system. This study is based on the one
carried out by the same team last year. This year, it has been decided to study
how this technique improves each of the datasets (last year only the MIC
dataset was studied). This modification shows an improvement in the performance
of the system compared to the baseline using MIC dataset.
Related papers
- The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023 [11.64675515432159]
We employ a multimodal fusion approach to combine visual and audio features.
High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network.
At the same time, audio features serve as complementary information to help the model better localize the start and end of sounds.
arXiv Detail & Related papers (2024-07-01T12:52:05Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Cross-Referencing Self-Training Network for Sound Event Detection in
Audio Mixtures [23.568610919253352]
This paper proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training.
The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.
arXiv Detail & Related papers (2021-05-27T18:46:59Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.