Active Learning for Sound Event Detection
- URL: http://arxiv.org/abs/2002.05033v2
- Date: Wed, 9 Sep 2020 14:49:55 GMT
- Title: Active Learning for Sound Event Detection
- Authors: Shuyang Zhao, Toni Heittola, Tuomas Virtanen
- Abstract summary: This paper proposes an active learning system for sound event detection (SED)
It aims at maximizing the accuracy of a learned SED model with limited annotation effort.
Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare.
- Score: 18.750572243562576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an active learning system for sound event detection
(SED). It aims at maximizing the accuracy of a learned SED model with limited
annotation effort. The proposed system analyzes an initially unlabeled audio
dataset, from which it selects sound segments for manual annotation. The
candidate segments are generated based on a proposed change point detection
approach, and the selection is based on the principle of mismatch-first
farthest-traversal. During the training of SED models, recordings are used as
training inputs, preserving the long-term context for annotated segments. The
proposed system clearly outperforms reference methods in the two datasets used
for evaluation (TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with
recordings as context outperforms training with only annotated segments.
Mismatch-first farthest-traversal outperforms reference sample selection
methods based on random sampling and uncertainty sampling. Remarkably, the
required annotation effort can be greatly reduced on the dataset where target
sound events are rare: by annotating only 2% of the training data, the achieved
SED performance is similar to annotating all the training data.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning [11.312115846980602]
We propose an adaptive change point detection method (A-CPD) for machine guided weak label annotation of audio recording segments.
For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation.
We show that it is possible to derive strong labels of high quality with a limited annotation budget, and show favorable results for A-CPD.
arXiv Detail & Related papers (2024-03-13T13:33:35Z) - Combating Label Noise With A General Surrogate Model For Sample
Selection [84.61367781175984]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.
We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Pretraining Representations for Bioacoustic Few-shot Detection using
Supervised Contrastive Learning [10.395255631261458]
In bioacoustic applications, most tasks come with few labelled training data, because annotating long recordings is time consuming and costly.
We show that learning a rich feature extractor from scratch can be achieved by leveraging data augmentation using a supervised contrastive learning framework.
We obtain an F-score of 63.46% on the validation set and 42.7% on the test set, ranking second in the DCASE challenge.
arXiv Detail & Related papers (2023-09-02T09:38:55Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Segment-level Metric Learning for Few-shot Bioacoustic Event Detection [56.59107110017436]
We propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization.
Our system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin.
arXiv Detail & Related papers (2022-07-15T22:41:30Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - A benchmark of state-of-the-art sound event detection systems evaluated
on synthetic soundscapes [10.512055210540668]
We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events.
Results show that systems tend to spuriously predict short events when non-target events are present.
arXiv Detail & Related papers (2022-02-03T09:41:31Z) - Cross-Referencing Self-Training Network for Sound Event Detection in
Audio Mixtures [23.568610919253352]
This paper proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training.
The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.
arXiv Detail & Related papers (2021-05-27T18:46:59Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.