Related papers: Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

URL: http://arxiv.org/abs/2509.06598v1
Date: Mon, 08 Sep 2025 12:07:32 GMT
Title: Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Authors: Davide Berghi, Philip J. B. Jackson,
Abstract summary: 3D SELD is a complex task that combines temporal event classification with spatial localization.<n>Traditional SELD approaches typically rely on multichannel input.<n>We enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models.
Score: 5.010383717530127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this study, we address the multimodal task of stereo sound event localization and detection with source distance estimation (3D SELD) in regular video content. 3D SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD approaches typically rely on multichannel input, limiting their capacity to benefit from large-scale pre-training due to data constraints. To overcome this, we enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. We perform an ablation study on the development set of the DCASE2025 Task3 Stereo SELD Dataset to assess the individual contributions of the language-aligned models and benchmark against the DCASE Task 3 baseline systems. Additionally, we detail the curation process of large synthetic audio and audio-visual datasets used for model pre-training. These datasets were further expanded through left-right channel swapping augmentation. Our approach, combining extensive pre-training, model ensembling, and visual post-processing, achieved second rank in the DCASE 2025 Challenge Task 3 (Track B), underscoring the effectiveness of our method. Future work will explore the modality-specific contributions and architectural refinements.

Related papers

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments [34.02990381039783]
We present JAEGER, a framework that extends AV-LLMs to 3D space to enable joint spatial grounding and reasoning.<n>A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation.<n>Our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks.
arXiv Detail & Related papers (2026-02-20T04:06:07Z)
ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video [5.010383717530127]
multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions.<n>We introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary subnetworks.<n>ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set.
arXiv Detail & Related papers (2026-01-24T22:26:39Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis [11.373305523732718]
Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems.<n>Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts.<n>AVF-MAE++ is a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA.
arXiv Detail & Related papers (2025-09-29T02:53:49Z)
Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos [3.2472293599354596]
This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event localization and Detection in Regular Video Content.<n>SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions.<n>To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs.
arXiv Detail & Related papers (2025-07-07T10:08:57Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation [3.2637535969755858]
The 3D SELD task addresses the limitation by integrating source distance estimation.<n>We propose three approaches to tackle this challenge: a novel method with independent training and joint prediction.<n>Our proposed method ranked first in the DCASE 2024 Challenge Task 3, demonstrating the effectiveness of joint modeling.
arXiv Detail & Related papers (2025-01-18T12:57:21Z)
A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders [5.069884983892437]
We propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets. In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction. Our approach is scalable, robust and suitable for pre-training RGB-D datasets.
arXiv Detail & Related papers (2024-08-05T05:33:59Z)
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio. Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z)
Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns. A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights [61.36309876889977]
ViT-Lens enables efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art. We will release the results of ViT-Lens on more modalities in the near future.
arXiv Detail & Related papers (2023-08-20T07:26:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.