Proposal-based Few-shot Sound Event Detection for Speech and
Environmental Sounds with Perceivers
- URL: http://arxiv.org/abs/2107.13616v2
- Date: Sat, 23 Dec 2023 18:34:14 GMT
- Title: Proposal-based Few-shot Sound Event Detection for Speech and
Environmental Sounds with Perceivers
- Authors: Piper Wolters, Logan Sizemore, Chris Daw, Brian Hutchinson, Lauren
Phillips
- Abstract summary: We propose a region proposal-based approach to few-shot sound event detection utilizing the Perceiver architecture.
Motivated by a lack of suitable benchmark datasets, we generate two new few-shot sound event localization datasets.
- Score: 0.7776497736451751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many applications involve detecting and localizing specific sound events
within long, untrimmed documents, including keyword spotting, medical
observation, and bioacoustic monitoring for conservation. Deep learning
techniques often set the state-of-the-art for these tasks. However, for some
types of events, there is insufficient labeled data to train such models. In
this paper, we propose a region proposal-based approach to few-shot sound event
detection utilizing the Perceiver architecture. Motivated by a lack of suitable
benchmark datasets, we generate two new few-shot sound event localization
datasets: "Vox-CASE," using clips of celebrity speech as the sound event, and
"ESC-CASE," using environmental sound events. Our highest performing proposed
few-shot approaches achieve 0.483 and 0.418 F1-score, respectively, with 5-shot
5-way tasks on these two datasets. These represent relative improvements of
72.5% and 11.2% over strong proposal-free few-shot sound event detection
baselines.
Related papers
- Multitask frame-level learning for few-shot sound event detection [46.32294691870714]
This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples.
We introduce an innovative multitask frame-level SED framework and TimeFilterAug, a linear timing mask for data augmentation.
The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category.
arXiv Detail & Related papers (2024-03-17T05:00:40Z) - Pretraining Representations for Bioacoustic Few-shot Detection using
Supervised Contrastive Learning [10.395255631261458]
In bioacoustic applications, most tasks come with few labelled training data, because annotating long recordings is time consuming and costly.
We show that learning a rich feature extractor from scratch can be achieved by leveraging data augmentation using a supervised contrastive learning framework.
We obtain an F-score of 63.46% on the validation set and 42.7% on the test set, ranking second in the DCASE challenge.
arXiv Detail & Related papers (2023-09-02T09:38:55Z) - AGS: An Dataset and Taxonomy for Domestic Scene Sound Event Recognition [1.5106201893222209]
This paper proposes a data set (called as AGS) for the home environment sound.
This data set considers various types of overlapping audio in the scene, background noise.
arXiv Detail & Related papers (2023-08-30T03:03:47Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Segment-level Metric Learning for Few-shot Bioacoustic Event Detection [56.59107110017436]
We propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization.
Our system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin.
arXiv Detail & Related papers (2022-07-15T22:41:30Z) - A benchmark of state-of-the-art sound event detection systems evaluated
on synthetic soundscapes [10.512055210540668]
We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events.
Results show that systems tend to spuriously predict short events when non-target events are present.
arXiv Detail & Related papers (2022-02-03T09:41:31Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - Improving weakly supervised sound event detection with self-supervised
auxiliary tasks [33.427215114252235]
We propose a shared encoder architecture with sound event detection as a primary task and an additional secondary decoder for a self-supervised auxiliary task.
We empirically evaluate the proposed framework for weakly supervised sound event detection on a remix dataset of the DCASE 2019 task 1 acoustic scene data.
The proposed framework with two-step attention outperforms existing benchmark models by 22.3%, 12.8%, 5.9% on 0, 10 and 20 dB SNR respectively.
arXiv Detail & Related papers (2021-06-12T20:28:22Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.