A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion
- URL: http://arxiv.org/abs/2404.09293v2
- Date: Wed, 21 Aug 2024 17:07:02 GMT
- Title: A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion
- Authors: Zihan Cao, Xiao Wu, Liang-Jian Deng, Yu Zhong,
- Abstract summary: In image fusion tasks, images from different sources possess distinct characteristics.
Mamba, as a state space model, has emerged in the field of natural language processing.
Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task.
- Score: 14.293042131263924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics.Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from causal language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simultaneously learn local and global spatial information. Furthermore, we propose the state sharing technique to enhance spatial details and integrate spatial and spectral information. Finally, the overall network is a multi-scale structure based on vision Mamba, called LE-Mamba. Extensive experiments show the proposed methods achieve state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion datasets, and demonstrate the effectiveness of the proposed approach. Codes can be accessed at \url{https://github.com/294coder/Efficient-MIF}.
Related papers
- LFSamba: Marry SAM with Mamba for Light Field Salient Object Detection [9.787855464038673]
A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information.
In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced.
arXiv Detail & Related papers (2024-11-11T01:37:32Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception.
We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective.
Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z) - MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications.
Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features.
We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z) - FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba [17.75933946414591]
Multi-modal image fusion aims to combine information from different modes to create a single image with detailed textures.
Transformer-based models, while excelling in global feature modeling, confront computational challenges stemming from their quadratic complexity.
We propose FusionMamba, a novel dynamic feature enhancement method for multimodal image fusion with Mamba.
arXiv Detail & Related papers (2024-04-15T06:37:21Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model [35.57157248152558]
Current deep learning (DL) methods typically employ convolutional neural networks (CNNs) or Transformers for feature extraction and information integration.
We propose FusionMamba, an innovative method for efficient remote sensing image fusion.
arXiv Detail & Related papers (2024-04-11T17:29:56Z) - RSMamba: Remote Sensing Image Classification with State Space Model [25.32283897448209]
We introduce RSMamba, a novel architecture for remote sensing image classification.
RSMamba is based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba.
We propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-temporal image data.
arXiv Detail & Related papers (2024-03-28T17:59:49Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z) - Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image
Fusion [59.19469551774703]
Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks.
We propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts.
Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate.
arXiv Detail & Related papers (2023-02-02T20:06:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.