Related papers: Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision

Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision

URL: http://arxiv.org/abs/2509.11476v1
Date: Sun, 14 Sep 2025 23:44:15 GMT
Title: Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision
Authors: Tianyao Sun, Dawei Xiang, Tianqi Ding, Xiang Fang, Yijiashun Qi, Zunduo Zhao,
Abstract summary: We propose a novel end-to-end fusion framework that explicitly models inter-modality interaction and enhances task-critical regions.<n> FusionNet introduces a modality-aware attention mechanism that dynamically adjusts the contribution of infrared and visible features.<n>Experiments on the public M3FD dataset demonstrate that FusionNet generates fused images with enhanced semantic preservation, high perceptual quality, and clear interpretability.
Score: 8.898422193366354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Infrared and visible image fusion (IVIF) is a fundamental task in multi-modal perception that aims to integrate complementary structural and textural cues from different spectral domains. In this paper, we propose FusionNet, a novel end-to-end fusion framework that explicitly models inter-modality interaction and enhances task-critical regions. FusionNet introduces a modality-aware attention mechanism that dynamically adjusts the contribution of infrared and visible features based on their discriminative capacity. To achieve fine-grained, interpretable fusion, we further incorporate a pixel-wise alpha blending module, which learns spatially-varying fusion weights in an adaptive and content-aware manner. Moreover, we formulate a target-aware loss that leverages weak ROI supervision to preserve semantic consistency in regions containing important objects (e.g., pedestrians, vehicles). Experiments on the public M3FD dataset demonstrate that FusionNet generates fused images with enhanced semantic preservation, high perceptual quality, and clear interpretability. Our framework provides a general and extensible solution for semantic-aware multi-modal image fusion, with benefits for downstream tasks such as object detection and scene understanding.

Related papers

CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion [51.060328159429154]
Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities.<n>We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts.<n> Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
arXiv Detail & Related papers (2026-01-12T13:36:48Z)
Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z)
Deep Unfolding Multi-modal Image Fusion Network via Attribution Analysis [16.6226059346127]
We propose an Unfolding Analysis Fusion network'' (UAAFusion) to tailor fused images more effectively for semantic segmentation.<n>We use attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination.<n>Our method constructs a model-driven unfolding network that uses optimization objectives derived from attribution analysis.
arXiv Detail & Related papers (2025-02-03T16:00:21Z)
DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion [21.64382683858586]
Infrared and visible image fusion aims to combine complementary information from both modalities to provide a more comprehensive scene understanding. We propose a dual-branch feature decomposition fusion network (DAF-Net) with Maximum domain adaptive. By incorporating MK-MMD, the DAF-Net effectively aligns the latent feature spaces of visible and infrared images, thereby improving the quality of the fused images.
arXiv Detail & Related papers (2024-09-18T02:14:08Z)
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z)
An Interactively Reinforced Paradigm for Joint Infrared-Visible Image Fusion and Saliency Object Detection [59.02821429555375]
This research focuses on the discovery and localization of hidden objects in the wild and serves unmanned systems. Through empirical analysis, infrared and visible image fusion (IVIF) enables hard-to-find objects apparent. multimodal salient object detection (SOD) accurately delineates the precise spatial location of objects within the picture.
arXiv Detail & Related papers (2023-05-17T06:48:35Z)
Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion [59.19469551774703]
Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks. We propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts. Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate.
arXiv Detail & Related papers (2023-02-02T20:06:58Z)
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network. We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z)
CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion [68.78897015832113]
We propose a coupled contrastive learning network, dubbed CoCoNet, to realize infrared and visible image fusion.<n>Our method achieves state-of-the-art (SOTA) performance under both subjective and objective evaluation.
arXiv Detail & Related papers (2022-11-20T12:02:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.