Multimodal Fusion SLAM with Fourier Attention
- URL: http://arxiv.org/abs/2506.18204v2
- Date: Tue, 24 Jun 2025 09:24:14 GMT
- Title: Multimodal Fusion SLAM with Fourier Attention
- Authors: Youjie Zhou, Guofeng Mei, Yiming Wang, Yi Wan, Fabio Poiesi,
- Abstract summary: We propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency.<n>Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals.<n>Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark conditions.
- Score: 15.2253217769593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational resources.To overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark conditions.Our code and datasets are available at https://github.com/youjie-zhou/FMF-SLAM.git.
Related papers
- FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model [8.392891463947661]
We propose Wavelet-Mamba, which integrates wavelet transform with the state-space model (SSM)<n>Wavelet-SSM module incorporates wavelet-based frequency domain feature extraction and global information extraction through SSM.<n>Our method achieves both visually compelling results and superior performance compared to current state-of-the-art methods.
arXiv Detail & Related papers (2025-03-24T06:25:44Z) - FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection [7.246630480680039]
Camouflaged Object Detection (COD) is challenging due to the strong similarity between camouflaged objects and their surroundings.<n>Existing methods mainly rely on spatial local features, failing to capture global information.<n> Frequency-Assisted Mamba-Like Linear Attention Network (FMNet) is proposed to efficiently capture global features.
arXiv Detail & Related papers (2025-03-14T02:55:19Z) - Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection [8.607385112274882]
Deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images.<n>Existing deep learning-based RGB-T SOD models suffer from two major limitations.<n>We propose a purely Fourier transform-based model, namely Deep Fourier-Embedded Network (DFENet) for accurate RGB-T SOD.
arXiv Detail & Related papers (2024-11-27T14:55:16Z) - Wavelet-based Mamba with Fourier Adjustment for Low-light Image Enhancement [26.13172849144202]
We propose a novel Wavelet-based Mamba with Fourier Adjustment model called WalMaFa.
WMB is adopted in the Decoder and FFAB is adopted in the Latent-Decoder structure.
Experiments demonstrate that our proposed WalMaFa achieves state-of-the-art performance with fewer computational resources and faster speed.
arXiv Detail & Related papers (2024-10-27T02:48:28Z) - A Dual Domain Multi-exposure Image Fusion Network based on the
Spatial-Frequency Integration [57.14745782076976]
Multi-exposure image fusion aims to generate a single high-dynamic image by integrating images with different exposures.
We propose a novelty perspective on multi-exposure image fusion via the Spatial-Frequency Integration Framework, named MEF-SFI.
Our method achieves visual-appealing fusion results against state-of-the-art multi-exposure image fusion approaches.
arXiv Detail & Related papers (2023-12-17T04:45:15Z) - Adaptive Frequency Filters As Efficient Global Token Mixers [100.27957692579892]
We show that adaptive frequency filters can serve as efficient global token mixers.
We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet.
arXiv Detail & Related papers (2023-07-26T07:42:28Z) - Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image
Fusion [59.19469551774703]
Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks.
We propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts.
Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate.
arXiv Detail & Related papers (2023-02-02T20:06:58Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z) - Functional Regularization for Reinforcement Learning via Learned Fourier
Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis.
We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z) - Beyond Self Attention: A Subquadratic Fourier Wavelet Transformer with Multi Modal Fusion [0.0]
We revisit the use of spectral techniques to replace the attention mechanism in Transformers.<n>We present a comprehensive and novel reformulation of this technique in next generation transformer models.
arXiv Detail & Related papers (2021-11-25T18:03:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.