CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion
- URL: http://arxiv.org/abs/2601.08619v1
- Date: Mon, 12 Jan 2026 13:36:48 GMT
- Title: CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion
- Authors: Yiming Sun, Yuan Ruan, Qinghua Hu, Pengfei Zhu,
- Abstract summary: Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities.<n>We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts.<n> Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
- Score: 51.060328159429154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities, enhancing environmental awareness for intelligent unmanned systems. Existing methods either focus on pixel-level fusion while overlooking downstream task adaptability or implicitly learn rigid semantics through cascaded detection/segmentation models, unable to interactively address diverse semantic target perception needs. We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts. The model integrates a multi-modal feature extractor, a reference prompt encoder (RPE), and a prompt-semantic fusion module (PSFM). The RPE dynamically encodes task-specific semantic prompts by fine-tuning pre-trained segmentation models with input mask guidance, while the PSFM explicitly injects these semantics into fusion features. Through synergistic optimization of parallel segmentation and fusion branches, our method achieves mutual enhancement between task performance and fusion quality. Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
Related papers
- Self-supervised Multiplex Consensus Mamba for General Image Fusion [34.041756423040184]
We propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion.<n> Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating.<n>Cross-modal scanning within MCCM strengthens feature interactions across modalities.<n>Bi-level Self-supervised Contrastive Learning Loss (BSCL) preserves high-frequency information without increasing computational overhead.
arXiv Detail & Related papers (2025-12-24T03:57:21Z) - Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching [42.22268167379098]
We formulate image fusion as a direct probabilistic transport from source modalities to the fused image distribution.<n>We employ a task-aware selection function to select the most reliable pseudo-labels for each task.<n>For multi-task scenarios, we integrate elastic weight consolidation and experience replay mechanisms to preserve cross-task performance.
arXiv Detail & Related papers (2025-11-17T02:56:48Z) - MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic Segmentation [43.62940654606311]
We propose a unified network for image fusion and semantic segmentation.<n>We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion.<n>Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently.
arXiv Detail & Related papers (2025-09-15T11:55:55Z) - SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion [65.80051636480836]
This paper proposes a conditional diffusion model guided by the Segment Anything Model (SAM) to achieve high-fidelity and semantically-aware image fusion.<n>The framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks as a condition to drive the diffusion model's coarse-to-fine denoising generation.<n>Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations.
arXiv Detail & Related papers (2025-08-07T10:58:52Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - A Task-guided, Implicitly-searched and Meta-initialized Deep Model for
Image Fusion [69.10255211811007]
We present a Task-guided, Implicit-searched and Meta- generalizationd (TIM) deep model to address the image fusion problem in a challenging real-world scenario.
Specifically, we propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion.
Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency.
arXiv Detail & Related papers (2023-05-25T08:54:08Z) - Equivariant Multi-Modality Image Fusion [124.11300001864579]
We propose the Equivariant Multi-Modality imAge fusion paradigm for end-to-end self-supervised learning.
Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations.
Experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images.
arXiv Detail & Related papers (2023-05-19T05:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.