An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation
- URL: http://arxiv.org/abs/2501.10755v1
- Date: Sat, 18 Jan 2025 12:57:21 GMT
- Title: An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation
- Authors: Yuxuan Dong, Qing Wang, Hengyi Hong, Ya Jiang, Shi Cheng,
- Abstract summary: The 3D SELD task addresses the limitation by integrating source distance estimation.
We propose three approaches to tackle this challenge: a novel method with independent training and joint prediction.
Our proposed method ranked first in the DCASE 2024 Challenge Task 3, demonstrating the effectiveness of joint modeling.
- Score: 3.2637535969755858
- License:
- Abstract: In traditional sound event localization and detection (SELD) tasks, the focus is typically on sound event detection (SED) and direction-of-arrival (DOA) estimation, but they fall short of providing full spatial information about the sound source. The 3D SELD task addresses this limitation by integrating source distance estimation (SDE), allowing for complete spatial localization. We propose three approaches to tackle this challenge: a novel method with independent training and joint prediction, which firstly treats DOA and distance estimation as separate tasks and then combines them to solve 3D SELD; a dual-branch representation with source Cartesian coordinate used for simultaneous DOA and distance estimation; and a three-branch structure that jointly models SED, DOA, and SDE within a unified framework. Our proposed method ranked first in the DCASE 2024 Challenge Task 3, demonstrating the effectiveness of joint modeling for addressing the 3D SELD task. The relevant code for this paper will be open-sourced in the future.
Related papers
- SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation [21.82296230219289]
We propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model.
We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks.
We implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss.
arXiv Detail & Related papers (2024-08-09T13:26:08Z) - OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average.
Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z) - Sound Event Detection and Localization with Distance Estimation [4.139846693958608]
3D SELD is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA)
We study two ways of integrating distance estimation within the SELD core.
Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
arXiv Detail & Related papers (2024-03-18T14:34:16Z) - Exploring Latent Cross-Channel Embedding for Accurate 3D Human Pose
Reconstruction in a Diffusion Framework [6.669850111205944]
Monocular 3D human pose estimation poses significant challenges due to inherent depth ambiguities that arise during the reprojection process from 2D to 3D.
Recent advancements in diffusion models have shown promise in incorporating structural priors to address reprojection ambiguities.
We propose a novel cross-channel embedding framework that aims to fully explore the correlation between joint-level features of 3D coordinates and their 2D projections.
arXiv Detail & Related papers (2024-01-18T09:53:03Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal
Distillation [44.940531391847]
We address the challenge of dense indoor prediction with sound in 2D and 3D via cross-modal knowledge distillation.
We are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations.
For audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-09-20T06:07:04Z) - Diffusion-based 3D Object Detection with Random Boxes [58.43022365393569]
Existing anchor-based 3D detection methods rely on empiricals setting of anchors, which makes the algorithms lack elegance.
Our proposed Diff3Det migrates the diffusion model to proposal generation for 3D object detection by considering the detection boxes as generative targets.
In the inference stage, the model progressively refines a set of random boxes to the prediction results.
arXiv Detail & Related papers (2023-09-05T08:49:53Z) - Delving into Localization Errors for Monocular 3D Object Detection [85.77319416168362]
Estimating 3D bounding boxes from monocular images is an essential component in autonomous driving.
In this work, we quantify the impact introduced by each sub-task and find the localization error' is the vital factor in restricting monocular 3D detection.
arXiv Detail & Related papers (2021-03-30T10:38:01Z) - Ensemble and Random Collaborative Representation-Based Anomaly Detector
for Hyperspectral Imagery [133.83048723991462]
We propose a novel ensemble and random collaborative representation-based detector (ERCRD) for hyperspectral anomaly detection (HAD)
Our experiments on four real hyperspectral datasets exhibit the accuracy and efficiency of this proposed ERCRD method compared with ten state-of-the-art HAD methods.
arXiv Detail & Related papers (2021-01-06T11:23:51Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z) - SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint
Estimation [3.1542695050861544]
Estimating 3D orientation and translation of objects is essential for infrastructure-less autonomous navigation and driving.
We propose a novel 3D object detection method, named SMOKE, that combines a single keypoint estimate with regressed 3D variables.
Despite of its structural simplicity, our proposed SMOKE network outperforms all existing monocular 3D detection methods on the KITTI dataset.
arXiv Detail & Related papers (2020-02-24T08:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.