SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation
- URL: http://arxiv.org/abs/2408.05057v1
- Date: Fri, 9 Aug 2024 13:26:08 GMT
- Title: SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation
- Authors: Da Mu, Zhicheng Zhang, Haobo Yue, Zehao Wang, Jin Tang, Jianqin Yin,
- Abstract summary: We propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model.
We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks.
We implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss.
- Score: 21.82296230219289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer's self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.
Related papers
- ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - Intent Detection in the Age of LLMs [3.755082744150185]
Intent detection is a critical component of task-oriented dialogue systems (TODS)
Traditional approaches relied on computationally efficient supervised sentence transformer encoder models.
The emergence of generative large language models (LLMs) with intrinsic world knowledge presents new opportunities to address these challenges.
arXiv Detail & Related papers (2024-10-02T15:01:55Z) - MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking [50.26836546224782]
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy.
The diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization.
This paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information.
arXiv Detail & Related papers (2024-04-18T11:09:25Z) - Sound Event Detection and Localization with Distance Estimation [4.139846693958608]
3D SELD is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA)
We study two ways of integrating distance estimation within the SELD core.
Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
arXiv Detail & Related papers (2024-03-18T14:34:16Z) - X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs)
Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - CoLoC: Conditioned Localizer and Classifier for Sound Event Localization
and Detection [0.0]
We describe Conditioned Localizer and the (CoLoC) which is a novel solution for Sound Event localization and Detection (SELD)
The solution constitutes of two stages: the localization is done first and is followed by classification conditioned by the output of the localizer.
We show that our solution improves on the baseline system in most metrics on the STARSS22 dataset.
arXiv Detail & Related papers (2022-10-25T11:37:43Z) - Sound Event Detection Transformer: An Event-based End-to-End Model for
Sound Event Detection [12.915110466077866]
Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc.
Existing models in SED mainly generate frame-level predictions, converting it into a sequence multi-label classification problem.
This paper firstly presents the 1D Detection Transformer (1D-DETR), inspired by Detection Transformer.
Given the characteristics of SED, the audio query and a one-to-many matching strategy are added to 1D-DETR to form the model of Sound Event Detection Transformer (SEDT)
arXiv Detail & Related papers (2021-10-05T12:56:23Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - DANCE: DAta-Network Co-optimization for Efficient Segmentation Model
Training and Inference [85.02494022662505]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.
It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.
Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z) - Event-LSTM: An Unsupervised and Asynchronous Learning-based
Representation for Event-based Data [8.931153235278831]
Event cameras are activity-driven bio-inspired vision sensors.
We propose Event-LSTM, an unsupervised Auto-Encoder architecture made up of LSTM layers.
We also push state-of-the-art event de-noising forward by introducing memory into the de-noising process.
arXiv Detail & Related papers (2021-05-10T09:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.