Improving weakly supervised sound event detection with self-supervised
auxiliary tasks
- URL: http://arxiv.org/abs/2106.06858v1
- Date: Sat, 12 Jun 2021 20:28:22 GMT
- Title: Improving weakly supervised sound event detection with self-supervised
auxiliary tasks
- Authors: Soham Deshmukh, Bhiksha Raj, Rita Singh
- Abstract summary: We propose a shared encoder architecture with sound event detection as a primary task and an additional secondary decoder for a self-supervised auxiliary task.
We empirically evaluate the proposed framework for weakly supervised sound event detection on a remix dataset of the DCASE 2019 task 1 acoustic scene data.
The proposed framework with two-step attention outperforms existing benchmark models by 22.3%, 12.8%, 5.9% on 0, 10 and 20 dB SNR respectively.
- Score: 33.427215114252235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While multitask and transfer learning has shown to improve the performance of
neural networks in limited data settings, they require pretraining of the model
on large datasets beforehand. In this paper, we focus on improving the
performance of weakly supervised sound event detection in low data and noisy
settings simultaneously without requiring any pretraining task. To that extent,
we propose a shared encoder architecture with sound event detection as a
primary task and an additional secondary decoder for a self-supervised
auxiliary task. We empirically evaluate the proposed framework for weakly
supervised sound event detection on a remix dataset of the DCASE 2019 task 1
acoustic scene data with DCASE 2018 Task 2 sounds event data under 0, 10 and 20
dB SNR. To ensure we retain the localisation information of multiple sound
events, we propose a two-step attention pooling mechanism that provides a
time-frequency localisation of multiple audio events in the clip. The proposed
framework with two-step attention outperforms existing benchmark models by
22.3%, 12.8%, 5.9% on 0, 10 and 20 dB SNR respectively. We carry out an
ablation study to determine the contribution of the auxiliary task and two-step
attention pooling to the SED performance improvement.
Related papers
- Pretraining Representations for Bioacoustic Few-shot Detection using
Supervised Contrastive Learning [10.395255631261458]
In bioacoustic applications, most tasks come with few labelled training data, because annotating long recordings is time consuming and costly.
We show that learning a rich feature extractor from scratch can be achieved by leveraging data augmentation using a supervised contrastive learning framework.
We obtain an F-score of 63.46% on the validation set and 42.7% on the test set, ranking second in the DCASE challenge.
arXiv Detail & Related papers (2023-09-02T09:38:55Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Robust, General, and Low Complexity Acoustic Scene Classification
Systems and An Effective Visualization for Presenting a Sound Scene Context [53.80051967863102]
We present a comprehensive analysis of Acoustic Scene Classification (ASC)
We propose an inception-based and low footprint ASC model, referred to as the ASC baseline.
Next, we improve the ASC baseline by proposing a novel deep neural network architecture.
arXiv Detail & Related papers (2022-10-16T19:07:21Z) - Wider or Deeper Neural Network Architecture for Acoustic Scene
Classification with Mismatched Recording Devices [59.86658316440461]
We present a robust and low complexity system for Acoustic Scene Classification (ASC)
We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue.
To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction.
arXiv Detail & Related papers (2022-03-23T10:27:41Z) - A benchmark of state-of-the-art sound event detection systems evaluated
on synthetic soundscapes [10.512055210540668]
We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events.
Results show that systems tend to spuriously predict short events when non-target events are present.
arXiv Detail & Related papers (2022-02-03T09:41:31Z) - Proposal-based Few-shot Sound Event Detection for Speech and
Environmental Sounds with Perceivers [0.7776497736451751]
We propose a region proposal-based approach to few-shot sound event detection utilizing the Perceiver architecture.
Motivated by a lack of suitable benchmark datasets, we generate two new few-shot sound event localization datasets.
arXiv Detail & Related papers (2021-07-28T19:46:55Z) - Cross-Referencing Self-Training Network for Sound Event Detection in
Audio Mixtures [23.568610919253352]
This paper proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training.
The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.
arXiv Detail & Related papers (2021-05-27T18:46:59Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Multi-Task Learning for Interpretable Weakly Labelled Sound Event
Detection [34.99472489405047]
This paper proposes a Multi-Task Learning framework for learning from Weakly Labelled Audio data.
We show that the chosen auxiliary task de-noises internal T-F representation and improves SED performance under noisy recordings.
The proposed total framework outperforms existing benchmark models over all SNRs.
arXiv Detail & Related papers (2020-08-17T04:46:25Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.