MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection
- URL: http://arxiv.org/abs/2406.08771v2
- Date: Sat, 15 Jun 2024 11:52:49 GMT
- Title: MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection
- Authors: Da Mu, Zhicheng Zhang, Haobo Yue,
- Abstract summary: Event-Independent Network V2 (EINV2) has achieved outstanding performance on Sound Event localization and Detection.
This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains.
- Score: 4.165508411354963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains. The MFF module utilizes parallel subnetworks architecture to generate multi-scale spectral and spatial features. The TF-Convolution Module is employed to provide multi-scale temporal features. We incorporated MFF into EINV2 and term the proposed method as MFF-EINV2. Experimental results in 2022 and 2023 DCASE challenge task3 datasets show the effectiveness of our MFF-EINV2, which achieves state-of-the-art (SOTA) performance compared to published methods.
Related papers
- Accelerated Multi-Contrast MRI Reconstruction via Frequency and Spatial Mutual Learning [50.74383395813782]
We propose a novel Frequency and Spatial Mutual Learning Network (FSMNet) to explore global dependencies across different modalities.
The proposed FSMNet achieves state-of-the-art performance for the Multi-Contrast MR Reconstruction task with different acceleration factors.
arXiv Detail & Related papers (2024-09-21T12:02:47Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection [5.65128683992597]
Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies.
Recent advances rely on introducing features from spatial or frequency domains rather than modeling general forgery features within backbones.
We propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules.
arXiv Detail & Related papers (2024-08-03T05:34:53Z) - SFFNet: A Wavelet-Based Spatial and Frequency Domain Fusion Network for Remote Sensing Segmentation [9.22384870426709]
We propose the SFFNet (Spatial and Frequency Domain Fusion Network) framework.
The first stage extracts features using spatial methods to obtain features with sufficient spatial details and semantic information.
The second stage maps these features in both spatial and frequency domains.
SFFNet achieves superior performance in terms of mIoU, reaching 84.80% and 87.73% respectively.
arXiv Detail & Related papers (2024-05-03T10:47:56Z) - A Multi-Stage Adaptive Feature Fusion Neural Network for Multimodal Gait
Recognition [15.080096318551346]
Most existing gait recognition algorithms are unimodal, and a few multimodal gait recognition algorithms perform multimodal fusion only once.
We propose a multi-stage feature fusion strategy (MSFFS), which performs multimodal fusions at different stages in the feature extraction process.
Also, we propose an adaptive feature fusion module (AFFM) that considers the semantic association between silhouettes and skeletons.
arXiv Detail & Related papers (2023-12-22T03:25:15Z) - Frequency Spectrum is More Effective for Multimodal Representation and
Fusion: A Multimodal Spectrum Rumor Detector [42.079129968058275]
Multimodal content, such as mixing text with images, presents significant challenges to rumor detection in social media.
This work makes the first attempt at multimodal rumor detection in the frequency domain, which efficiently transforms spatial features into the frequency spectrum.
A novel Frequency Spectrum Representation and fUsion network (FSRU) with dual contrastive learning reveals the frequency spectrum is more effective for multimodal representation and fusion.
arXiv Detail & Related papers (2023-12-18T08:55:42Z) - MBDF-Net: Multi-Branch Deep Fusion Network for 3D Object Detection [17.295359521427073]
We propose a Multi-Branch Deep Fusion Network (MBDF-Net) for 3D object detection.
In the first stage, our multi-branch feature extraction network utilizes Adaptive Attention Fusion modules to produce cross-modal fusion features from single-modal semantic features.
In the second stage, we use a region of interest (RoI) -pooled fusion module to generate enhanced local features for refinement.
arXiv Detail & Related papers (2021-08-29T15:40:15Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - DONet: Dual-Octave Network for Fast MR Image Reconstruction [98.04121143761017]
The Dual-Octave Network (DONet) is capable of learning multi-scale spatial-frequency features from both the real and imaginary components of MR data.
Our framework provides three appealing benefits.
arXiv Detail & Related papers (2021-05-12T21:41:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.