PILOT: Introducing Transformers for Probabilistic Sound Event
Localization
- URL: http://arxiv.org/abs/2106.03903v1
- Date: Mon, 7 Jun 2021 18:29:19 GMT
- Title: PILOT: Introducing Transformers for Probabilistic Sound Event
Localization
- Authors: Christopher Schymura, Benedikt B\"onninghoff, Tsubasa Ochiai, Marc
Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa
- Abstract summary: This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
- Score: 107.78964411642401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound event localization aims at estimating the positions of sound sources in
the environment with respect to an acoustic receiver (e.g. a microphone array).
Recent advances in this domain most prominently focused on utilizing deep
recurrent neural networks. Inspired by the success of transformer architectures
as a suitable alternative to classical recurrent neural networks, this paper
introduces a novel transformer-based sound event localization framework, where
temporal dependencies in the received multi-channel audio signals are captured
via self-attention mechanisms. Additionally, the estimated sound event
positions are represented as multivariate Gaussian variables, yielding an
additional notion of uncertainty, which many previously proposed deep
learning-based systems designed for this application do not provide. The
framework is evaluated on three publicly available multi-source sound event
localization datasets and compared against state-of-the-art methods in terms of
localization error and event detection accuracy. It outperforms all competing
systems on all datasets with statistical significant differences in
performance.
Related papers
- Sound event localization and classification using WASN in Outdoor Environment [2.234738672139924]
Methods for sound event localization and classification typically rely on a single microphone array.
We propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source.
arXiv Detail & Related papers (2024-03-29T11:44:14Z) - A unified multichannel far-field speech recognition system: combining
neural beamforming with attention based end-to-end model [14.795953417531907]
We propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system.
The proposed method achieve 19.26% improvement when compared with a strong baseline.
arXiv Detail & Related papers (2024-01-05T07:11:13Z) - Acoustic-Net: A Novel Neural Network for Sound Localization and
Quantification [28.670240455952317]
A novel neural network, termed the Acoustic-Net, is proposed to locate and quantify the sound source simply using the original signals.
The experiments demonstrate that the proposed method significantly improves the accuracy of sound source prediction and the computing speed.
arXiv Detail & Related papers (2022-03-31T12:20:09Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.