Sound Event Detection and Localization with Distance Estimation
- URL: http://arxiv.org/abs/2403.11827v2
- Date: Wed, 12 Jun 2024 13:54:11 GMT
- Title: Sound Event Detection and Localization with Distance Estimation
- Authors: Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros,
- Abstract summary: 3D SELD is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA)
We study two ways of integrating distance estimation within the SELD core.
Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
- Score: 4.139846693958608
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
Related papers
- Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation [3.2472293599354596]
This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event localization and Detection with Source Distance Estimation (Track B)
Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively.
This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x.
arXiv Detail & Related papers (2024-10-29T17:28:43Z) - SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation [21.82296230219289]
We propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model.
We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks.
We implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss.
arXiv Detail & Related papers (2024-08-09T13:26:08Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z) - Joint Direction and Proximity Classification of Overlapping Sound Events
from Binaural Audio [7.050270263489538]
We aim to investigate several ways of performing joint proximity and direction estimation from recordings.
Considering the limitations of audio, we propose two methods of splitting the sphere into angular areas in order to obtain a set of directional classes.
We propose various ways of combining the proximity and direction estimation problems into a joint task providing temporal information about the onsets and offsets of appearing sources.
arXiv Detail & Related papers (2021-07-26T08:48:46Z) - What Makes Sound Event Localization and Detection Difficult? Insights
from Error Analysis [15.088901748728391]
Sound event localization and detection (SELD) is an emerging research topic that aims to unify the tasks of sound event detection and direction-of-arrival estimation.
SELD inherits the challenges of both tasks, such as noise, reverberation, interference, polyphony, and non-stationarity of sound sources.
Previous studies have shown that unknown interferences in reverberant environments often cause major degradation in the performance of SELD systems.
arXiv Detail & Related papers (2021-07-22T06:01:49Z) - DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic
Sound Event Localization and Detection [16.18806719313959]
We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival.
We show that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin.
arXiv Detail & Related papers (2021-06-29T09:18:30Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space.
We propose a model that unifies these two tasks in the same metric space.
Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.