MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
- URL: http://arxiv.org/abs/2602.01760v1
- Date: Mon, 02 Feb 2026 07:43:29 GMT
- Title: MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
- Authors: Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma,
- Abstract summary: We propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level.<n>We develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image.
- Score: 38.48174002671134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.
Related papers
- MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion [10.160499805076755]
We introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery.<n>By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations.<n>It delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.
arXiv Detail & Related papers (2025-09-16T09:58:06Z) - SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion [65.80051636480836]
This paper proposes a conditional diffusion model guided by the Segment Anything Model (SAM) to achieve high-fidelity and semantically-aware image fusion.<n>The framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks as a condition to drive the diffusion model's coarse-to-fine denoising generation.<n>Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations.
arXiv Detail & Related papers (2025-08-07T10:58:52Z) - Dual-modal Prior Semantic Guided Infrared and Visible Image Fusion for Intelligent Transportation System [22.331591533400402]
Infrared and visible image fusion (IVF) plays an important role in intelligent transportation system (ITS)
We propose a novel prior semantic guided image fusion method based on the dual-modality strategy.
arXiv Detail & Related papers (2024-03-24T16:41:50Z) - UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion Framework [18.30261731071375]
We propose a unified framework for infrared and visible images fusion in complex scenes, termed UMCFuse.<n>We classify the pixels of visible images from the degree of scattering of light transmission, allowing us to separate fine details from overall intensity.
arXiv Detail & Related papers (2024-02-03T09:27:33Z) - A Dual Domain Multi-exposure Image Fusion Network based on the
Spatial-Frequency Integration [57.14745782076976]
Multi-exposure image fusion aims to generate a single high-dynamic image by integrating images with different exposures.
We propose a novelty perspective on multi-exposure image fusion via the Spatial-Frequency Integration Framework, named MEF-SFI.
Our method achieves visual-appealing fusion results against state-of-the-art multi-exposure image fusion approaches.
arXiv Detail & Related papers (2023-12-17T04:45:15Z) - SSPFusion: A Semantic Structure-Preserving Approach for Infrared and Visible Image Fusion [15.513687345562499]
We propose a semantic structure-preserving fusion approach for multi-modality image fusion.<n>We show that our method outperforms nine state-of-the-art methods in terms of both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2023-09-26T08:13:32Z) - CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion [68.78897015832113]
We propose a coupled contrastive learning network, dubbed CoCoNet, to realize infrared and visible image fusion.<n>Our method achieves state-of-the-art (SOTA) performance under both subjective and objective evaluation.
arXiv Detail & Related papers (2022-11-20T12:02:07Z) - Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z) - Unsupervised Image Fusion Method based on Feature Mutual Mapping [16.64607158983448]
We propose an unsupervised adaptive image fusion method to address the above issues.
We construct a global map to measure the connections of pixels between the input source images.
Our method achieves superior performance in both visual perception and objective evaluation.
arXiv Detail & Related papers (2022-01-25T07:50:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.