Robust detection of overlapping bioacoustic sound events
- URL: http://arxiv.org/abs/2503.02389v1
- Date: Tue, 04 Mar 2025 08:26:03 GMT
- Title: Robust detection of overlapping bioacoustic sound events
- Authors: Louis Mahon, Benjamin Hoffman, Logan S James, Maddie Cusimano, Masato Hagiwara, Sarah C Woolley, Olivier Pietquin,
- Abstract summary: We introduce an onset-based detection method which we name Voxaboxen.<n>For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is.<n>We release a new dataset designed to measure performance on detecting overlapping vocalizations.
- Score: 16.976684123806653
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a method for accurately detecting bioacoustic sound events that is robust to overlapping events, a common issue in domains such as ethology, ecology and conservation. While standard methods employ a frame-based, multi-label approach, we introduce an onset-based detection method which we name Voxaboxen. It takes inspiration from object detection methods in computer vision, but simultaneously takes advantage of recent advances in self-supervised audio encoders. For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is. It also does the same in reverse, predicting whether each window contains the end of a vocalization, and how long ago it started. The two resulting sets of bounding boxes are then fused using a graph-matching algorithm. We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set. We compare Voxaboxen to natural baselines and existing sound event detection methods and demonstrate SotA results. Further experiments show that improvements are robust to frequent vocalization overlap.
Related papers
- Unsupervised outlier detection to improve bird audio dataset labels [0.0]
Non-target bird species sounds can result in dataset labeling discrepancies referred to as label noise.
We present a cleaning process consisting of audio preprocessing followed by dimensionality reduction and unsupervised outlier detection.
arXiv Detail & Related papers (2025-04-25T19:04:40Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - Influential Rank: A New Perspective of Post-training for Robust Model
against Noisy Labels [23.80449026013167]
We propose a new approach for learning from noisy labels (LNL) via post-training.
We exploit the overfitting property of a trained model to identify mislabeled samples.
Our post-training approach creates great synergies when combined with the existing LNL methods.
arXiv Detail & Related papers (2021-06-14T08:04:18Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - Automatic audiovisual synchronisation for ultrasound tongue imaging [35.60751372748571]
Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised.
Synchronisation is achieved using specialised hardware at recording time, but this approach can fail in practice resulting in data of limited usability.
In this paper, we address the problem of automatically synchronising ultrasound and audio after data collection.
We describe our approach for automatic synchronisation, which is driven by a self-supervised neural network, exploiting the correlation between the two signals to synchronise them.
arXiv Detail & Related papers (2021-05-31T17:11:28Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Unsupervised Classification of Voiced Speech and Pitch Tracking Using
Forward-Backward Kalman Filtering [14.950964357181524]
We present a new algorithm that integrates the three subtasks into a single procedure.
The algorithm can be applied to pre-recorded speech utterances in the presence of considerable amounts of background noise.
arXiv Detail & Related papers (2021-03-01T18:13:23Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.