ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video
- URL: http://arxiv.org/abs/2601.17611v1
- Date: Sat, 24 Jan 2026 22:26:39 GMT
- Title: ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video
- Authors: Davide Berghi, Philip J. B. Jackson,
- Abstract summary: multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions.<n>We introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary subnetworks.<n>ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set.
- Score: 5.010383717530127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound event localization and detection with distance estimation (3D SELD) in video involves identifying active sound events at each time frame while estimating their spatial coordinates. This multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions, a challenge that single models often struggle to address effectively. To tackle this, we introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary sub-networks: a spatio-linguistic model, a spatio-temporal model, and a tempo-linguistic model. Each sub-network specializes in a unique pair of dimensions, contributing distinct insights to the final prediction, akin to a collaborative team with diverse expertise. ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set, consistently outperforming existing methods across key metrics. Future work will extend this proof of concept by strengthening the specialists with appropriate tasks, training, and pre-training curricula.
Related papers
- Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound [5.591620304505415]
This work presents the first formal framework for Audio-Visual World Models (AVWM)<n>It formulates multimodal environment simulation as a partially observable decision process with audio-visual observations, fine-grained actions, and task rewards.<n>We propose an Audio-Visual Conditional Transformer with a novel modality expert architecture that balances visual and auditory learning.
arXiv Detail & Related papers (2025-11-30T13:11:56Z) - AudioScene: Integrating Object-Event Audio into 3D Scenes [19.66595321540055]
We present two novel audiospatial scene datasets, AudioScanNet and AudioRoboTHOR.<n>By integrating audio clips with spatially aligned 3D scenes, our datasets enable research on how audio signals interact with spatial context.
arXiv Detail & Related papers (2025-11-25T14:28:13Z) - Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos [5.010383717530127]
3D SELD is a complex task that combines temporal event classification with spatial localization.<n>Traditional SELD approaches typically rely on multichannel input.<n>We enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models.
arXiv Detail & Related papers (2025-09-08T12:07:32Z) - Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos [3.2472293599354596]
This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event localization and Detection in Regular Video Content.<n>SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions.<n>To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs.
arXiv Detail & Related papers (2025-07-07T10:08:57Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation [3.2637535969755858]
The 3D SELD task addresses the limitation by integrating source distance estimation.<n>We propose three approaches to tackle this challenge: a novel method with independent training and joint prediction.<n>Our proposed method ranked first in the DCASE 2024 Challenge Task 3, demonstrating the effectiveness of joint modeling.
arXiv Detail & Related papers (2025-01-18T12:57:21Z) - A Unified Framework for 3D Scene Understanding [50.6762892022386]
UniSeg3D is a unified 3D scene understanding framework.<n>It achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary segmentation tasks within a single model.
arXiv Detail & Related papers (2024-07-03T16:50:07Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space.
We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.