SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform
- URL: http://arxiv.org/abs/2106.06969v1
- Date: Sun, 13 Jun 2021 11:43:41 GMT
- Title: SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform
- Authors: Yuhang He, Niki Trigoni, Andrew Markham
- Abstract summary: SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
- Score: 48.68714598985078
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a new framework SoundDet, which is an end-to-end trainable and
light-weight framework, for polyphonic moving sound event detection and
localization. Prior methods typically approach this problem by preprocessing
raw waveform into time-frequency representations, which is more amenable to
process with well-established image processing pipelines. Prior methods also
detect in segment-wise manner, leading to incomplete and partial detections.
SoundDet takes a novel approach and directly consumes the raw, multichannel
waveform and treats the spatio-temporal sound event as a complete
``sound-object" to be detected. Specifically, SoundDet consists of a backbone
neural network and two parallel heads for temporal detection and spatial
localization, respectively. Given the large sampling rate of raw waveform, the
backbone network first learns a set of phase-sensitive and frequency-selective
bank of filters to explicitly retain direction-of-arrival information, whilst
being highly computationally and parametrically efficient than standard 1D/2D
convolution. A dense sound event proposal map is then constructed to handle the
challenges of predicting events with large varying temporal duration.
Accompanying the dense proposal map are a temporal overlapness map and a motion
smoothness map that measure a proposal's confidence to be an event from
temporal detection accuracy and movement consistency perspective. Involving the
two maps guarantees SoundDet to be trained in a spatio-temporally unified
manner. Experimental results on the public DCASE dataset show the advantage of
SoundDet on both segment-based and our newly proposed event-based evaluation
system.
Related papers
- DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - MomentDiff: Generative Video Moment Retrieval from Random to Real [71.40038773943638]
We provide a generative diffusion-based framework called MomentDiff.
MomentDiff simulates a typical human retrieval process from random browsing to gradual localization.
We show that MomentDiff consistently outperforms state-of-the-art methods on three public benchmarks.
arXiv Detail & Related papers (2023-07-06T09:12:13Z) - A benchmark of state-of-the-art sound event detection systems evaluated
on synthetic soundscapes [10.512055210540668]
We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events.
Results show that systems tend to spuriously predict short events when non-target events are present.
arXiv Detail & Related papers (2022-02-03T09:41:31Z) - DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic
Sound Event Localization and Detection [16.18806719313959]
We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival.
We show that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin.
arXiv Detail & Related papers (2021-06-29T09:18:30Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Cross-Referencing Self-Training Network for Sound Event Detection in
Audio Mixtures [23.568610919253352]
This paper proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training.
The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.
arXiv Detail & Related papers (2021-05-27T18:46:59Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.