FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba
- URL: http://arxiv.org/abs/2404.09498v3
- Date: Sun, 02 Feb 2025 13:57:05 GMT
- Title: FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba
- Authors: Xinyu Xie, Yawen Cui, Tao Tan, Xubin Zheng, Zitong Yu,
- Abstract summary: FusionMamba aims to overcome the challenges faced by CNNs and Vision Transformers (ViTs) in computer vision tasks.<n>The framework improves the visual state-space model Mamba by integrating dynamic convolution and channel attention mechanisms.<n>Experiments show that FusionMamba achieves state-of-the-art performance in a variety of multimodal image fusion tasks as well as downstream experiments.
- Score: 19.761723108363796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal image fusion aims to integrate information from different imaging techniques to produce a comprehensive, detail-rich single image for downstream vision tasks. Existing methods based on local convolutional neural networks (CNNs) struggle to capture global features efficiently, while Transformer-based models are computationally expensive, although they excel at global modeling. Mamba addresses these limitations by leveraging selective structured state space models (S4) to effectively handle long-range dependencies while maintaining linear complexity. In this paper, we propose FusionMamba, a novel dynamic feature enhancement framework that aims to overcome the challenges faced by CNNs and Vision Transformers (ViTs) in computer vision tasks. The framework improves the visual state-space model Mamba by integrating dynamic convolution and channel attention mechanisms, which not only retains its powerful global feature modeling capability, but also greatly reduces redundancy and enhances the expressiveness of local features. In addition, we have developed a new module called the dynamic feature fusion module (DFFM). It combines the dynamic feature enhancement module (DFEM) for texture enhancement and disparity perception with the cross-modal fusion Mamba module (CMFM), which focuses on enhancing the inter-modal correlation while suppressing redundant information. Experiments show that FusionMamba achieves state-of-the-art performance in a variety of multimodal image fusion tasks as well as downstream experiments, demonstrating its broad applicability and superiority.
Related papers
- An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas.
We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z) - vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition [0.0]
State-space models (SSMs) offer an alternative, but their application in vision remains underexplored.
This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness.
Tests on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
arXiv Detail & Related papers (2025-03-27T08:39:58Z) - M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification [23.322598623627222]
M$3$amba is a novel end-to-end CLIP-driven Mamba model for multi-modal fusion.
We introduce CLIP-driven modality-specific adapters to achieve a comprehensive semantic understanding of different modalities.
Experiments have shown that M$3$amba has an average performance improvement of at least 5.98% compared with the state-of-the-art methods.
arXiv Detail & Related papers (2025-03-09T05:06:47Z) - DAMamba: Vision State Space Model with Dynamic Adaptive Scan [51.81060691414399]
State space models (SSMs) have recently garnered significant attention in computer vision.
We propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions.
Based on DAS, we propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks.
arXiv Detail & Related papers (2025-02-18T08:12:47Z) - MatIR: A Hybrid Mamba-Transformer Image Restoration Model [95.17418386046054]
We propose a Mamba-Transformer hybrid image restoration model called MatIR.
MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features.
In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths.
arXiv Detail & Related papers (2025-01-30T14:55:40Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond [74.96466744512992]
The essence of image fusion is to integrate complementary information from source images.
DeFusion++ produces versatile fused representations that can enhance the quality of image fusion and the effectiveness of downstream high-level vision tasks.
arXiv Detail & Related papers (2024-10-16T06:28:49Z) - DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection [37.701518424351505]
Depression is a common mental disorder that affects millions of people worldwide.
We propose an audio-visual progressive fusion Mamba for multimodal depression detection, termed DepMamba.
arXiv Detail & Related papers (2024-09-24T09:58:07Z) - Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion [15.79138560700532]
We propose a dual-branch image fusion network called Tmamba.
It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity.
Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2024-09-05T03:42:11Z) - Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise.
MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains.
Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z) - Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion [18.138433117711177]
We propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking.
The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities.
Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with other state-of-the-art methods.
arXiv Detail & Related papers (2024-05-28T07:24:56Z) - IRSRMamba: Infrared Image Super-Resolution via Mamba-based Wavelet Transform Feature Modulation Model [7.842507196763463]
IRSRMamba is a novel framework integrating wavelet transform feature modulation for multi-scale adaptation.
IRSRMamba outperforms state-of-the-art methods in PSNR, SSIM, and perceptual quality.
This work establishes Mamba-based architectures as a promising direction for high-fidelity IR image enhancement.
arXiv Detail & Related papers (2024-05-16T07:49:24Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion [4.2474907126377115]
Multi-modality image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image.
We propose a Mamba-based Dual-phase Fusion model (MambaDFuse) to extract modality-specific and modality-fused features.
Our approach achieves promising fusion results in infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2024-04-12T11:33:26Z) - FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model [35.57157248152558]
Current deep learning (DL) methods typically employ convolutional neural networks (CNNs) or Transformers for feature extraction and information integration.
We propose FusionMamba, an innovative method for efficient remote sensing image fusion.
arXiv Detail & Related papers (2024-04-11T17:29:56Z) - A Task-guided, Implicitly-searched and Meta-initialized Deep Model for
Image Fusion [69.10255211811007]
We present a Task-guided, Implicit-searched and Meta- generalizationd (TIM) deep model to address the image fusion problem in a challenging real-world scenario.
Specifically, we propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion.
Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency.
arXiv Detail & Related papers (2023-05-25T08:54:08Z) - Equivariant Multi-Modality Image Fusion [124.11300001864579]
We propose the Equivariant Multi-Modality imAge fusion paradigm for end-to-end self-supervised learning.
Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations.
Experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images.
arXiv Detail & Related papers (2023-05-19T05:50:24Z) - Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image
Fusion [59.19469551774703]
Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks.
We propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts.
Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate.
arXiv Detail & Related papers (2023-02-02T20:06:58Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.