Related papers: SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

URL: http://arxiv.org/abs/2408.05057v1
Date: Fri, 9 Aug 2024 13:26:08 GMT
Title: SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation
Authors: Da Mu, Zhicheng Zhang, Haobo Yue, Zehao Wang, Jin Tang, Jianqin Yin,
Abstract summary: We propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks. We implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss.
Score: 21.82296230219289
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer's self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.

Related papers

Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing [6.579756339673344]
Out-of-scope (OOS) intent detection is a critical challenge in task-oriented dialogue systems (TODS)<n>We propose a novel but simple modular framework that combines uncertainty modeling with fine-tuned large language models (LLMs) for efficient and accurate OOS detection.
arXiv Detail & Related papers (2025-07-02T09:51:41Z)
Reliable Few-shot Learning under Dual Noises [166.53173694689693]
We propose DEnoised Task Adaptation (DETA++) for reliable few-shot learning.<n>DETA++ employs a memory bank to store and refine clean regions for each inner-task class, based on which a Local Nearestid (LocalNCC) is devised to yield noise-robust predictions on query samples.<n>Extensive experiments demonstrate the effectiveness and flexibility of DETA++.
arXiv Detail & Related papers (2025-06-19T14:05:57Z)
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z)
An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation [3.2637535969755858]
The 3D SELD task addresses the limitation by integrating source distance estimation. We propose three approaches to tackle this challenge: a novel method with independent training and joint prediction. Our proposed method ranked first in the DCASE 2024 Challenge Task 3, demonstrating the effectiveness of joint modeling.
arXiv Detail & Related papers (2025-01-18T12:57:21Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
Intent Detection in the Age of LLMs [3.755082744150185]
Intent detection is a critical component of task-oriented dialogue systems (TODS) Traditional approaches relied on computationally efficient supervised sentence transformer encoder models. The emergence of generative large language models (LLMs) with intrinsic world knowledge presents new opportunities to address these challenges.
arXiv Detail & Related papers (2024-10-02T15:01:55Z)
MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking [50.26836546224782]
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy. The diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. This paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information.
arXiv Detail & Related papers (2024-04-18T11:09:25Z)
Sound Event Detection and Localization with Distance Estimation [4.139846693958608]
3D SELD is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA) We study two ways of integrating distance estimation within the SELD core. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
arXiv Detail & Related papers (2024-03-18T14:34:16Z)
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs) Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z)
DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z)
CoLoC: Conditioned Localizer and Classifier for Sound Event Localization and Detection [0.0]
We describe Conditioned Localizer and the (CoLoC) which is a novel solution for Sound Event localization and Detection (SELD) The solution constitutes of two stages: the localization is done first and is followed by classification conditioned by the output of the localizer. We show that our solution improves on the baseline system in most metrics on the STARSS22 dataset.
arXiv Detail & Related papers (2022-10-25T11:37:43Z)
Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection [12.915110466077866]
Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc. Existing models in SED mainly generate frame-level predictions, converting it into a sequence multi-label classification problem. This paper firstly presents the 1D Detection Transformer (1D-DETR), inspired by Detection Transformer. Given the characteristics of SED, the audio query and a one-to-many matching strategy are added to 1D-DETR to form the model of Sound Event Detection Transformer (SEDT)
arXiv Detail & Related papers (2021-10-05T12:56:23Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference [85.02494022662505]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference. It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity. Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z)
Event-LSTM: An Unsupervised and Asynchronous Learning-based Representation for Event-based Data [8.931153235278831]
Event cameras are activity-driven bio-inspired vision sensors. We propose Event-LSTM, an unsupervised Auto-Encoder architecture made up of LSTM layers. We also push state-of-the-art event de-noising forward by introducing memory into the de-noising process.
arXiv Detail & Related papers (2021-05-10T09:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.