Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization
- URL: http://arxiv.org/abs/2103.00417v1
- Date: Sun, 28 Feb 2021 07:52:20 GMT
- Title: Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization
- Authors: Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke
Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa
- Abstract summary: This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
- Score: 113.19483349876668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound event localization frameworks based on deep neural networks have shown
increased robustness with respect to reverberation and noise in comparison to
classical parametric approaches. In particular, recurrent architectures that
incorporate temporal context into the estimation process seem to be well-suited
for this task. This paper proposes a novel approach to sound event localization
by utilizing an attention-based sequence-to-sequence model. These types of
models have been successfully applied to problems in natural language
processing and automatic speech recognition. In this work, a multi-channel
audio signal is encoded to a latent representation, which is subsequently
decoded to a sequence of estimated directions-of-arrival. Herein, attentions
allow for capturing temporal dependencies in the audio signal by focusing on
specific frames that are relevant for estimating the activity and
direction-of-arrival of sound events at the current time-step. The framework is
evaluated on three publicly available datasets for sound event localization. It
yields superior localization performance compared to state-of-the-art methods
in both anechoic and reverberant conditions.
Related papers
- DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic
Sound Event Localization and Detection [16.18806719313959]
We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival.
We show that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin.
arXiv Detail & Related papers (2021-06-29T09:18:30Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.