DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion
- URL: http://arxiv.org/abs/2601.05538v1
- Date: Fri, 09 Jan 2026 05:26:54 GMT
- Title: DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion
- Authors: Yiming Sun, Zifan Ye, Qinghua Hu, Pengfei Zhu,
- Abstract summary: Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content.<n>We propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion.<n>Our method outperforms existing approaches in both visual quality and quantitative evaluation.
- Score: 51.07069814578009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.
Related papers
- Efficient Rectified Flow for Image Fusion [48.330480065862474]
We propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow.<n>We also propose a task-specific variational autoencoder architecture tailored for image fusion.<n>Our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality.
arXiv Detail & Related papers (2025-09-20T06:21:00Z) - PIF-Net: Ill-Posed Prior Guided Multispectral and Hyperspectral Image Fusion via Invertible Mamba and Fusion-Aware LoRA [0.16385815610837165]
The goal of multispectral and hyperspectral image fusion (MHIF) is to generate high-quality images that simultaneously possess rich spectral information and fine spatial details.<n>Previous studies have not effectively addressed the ill-posed nature caused by data misalignment.<n>We propose a fusion framework named PIF-Net, which explicitly incorporates ill-posed priors to effectively fuse multispectral images and hyperspectral images.
arXiv Detail & Related papers (2025-08-01T09:17:17Z) - A Fusion-Guided Inception Network for Hyperspectral Image Super-Resolution [4.487807378174191]
We propose a single-image super-resolution model called the Fusion-Guided Inception Network (FGIN)<n>Specifically, we first employ a spectral-spatial fusion module to effectively integrate spectral and spatial information.<n>An Inception-like hierarchical feature extraction strategy is used to capture multiscale spatial dependencies.<n>To further enhance reconstruction quality, we incorporate an optimized upsampling module that combines bilinear with depthwise separable convolutions.
arXiv Detail & Related papers (2025-05-06T11:15:59Z) - Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise.
MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains.
Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - A Dual Domain Multi-exposure Image Fusion Network based on the
Spatial-Frequency Integration [57.14745782076976]
Multi-exposure image fusion aims to generate a single high-dynamic image by integrating images with different exposures.
We propose a novelty perspective on multi-exposure image fusion via the Spatial-Frequency Integration Framework, named MEF-SFI.
Our method achieves visual-appealing fusion results against state-of-the-art multi-exposure image fusion approaches.
arXiv Detail & Related papers (2023-12-17T04:45:15Z) - Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone.
Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z) - Multi-Spectral Image Stitching via Spatial Graph Reasoning [52.27796682972484]
We propose a spatial graph reasoning based multi-spectral image stitching method.
We embed multi-scale complementary features from the same view position into a set of nodes.
By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features.
arXiv Detail & Related papers (2023-07-31T15:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.