MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics
- URL: http://arxiv.org/abs/2511.12525v1
- Date: Sun, 16 Nov 2025 09:43:12 GMT
- Title: MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics
- Authors: Jing Li, Yifan Wang, Jiafeng Yan, Renlong Zhang, Bin Yang,
- Abstract summary: Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result.<n>We propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF)<n>To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, we employ a pre-trained vision-language model (VLM) in our framework.
- Score: 8.783211177601045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby compromising fusion performance; and 2) rely on fixed network architectures, limiting their adaptability to diverse degradation scenarios. To address these issues, we propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF). Given the distinct scattering characteristics of different degradation scenarios (e.g., haze, rain, and snow) in atmospheric transmission, a mixture-of-experts (MoE) system is introduced to tackle image fusion across multiple degradation scenarios. To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, collectively referred to as the semantic prior, we employ a pre-trained vision-language model (VLM) in our framework. Guided by the semantic prior, we propose degradation-aware channel attention module (DCAM), which employ degradation prototype decomposition to facilitate multi-modal feature interaction in channel domain. In addition, to achieve effective expert routing, the semantic prior and channel-domain modulated features are utilized to guide the MoE, enabling robust image fusion in complex degradation scenarios. Extensive experiments validate the effectiveness of our MdaIF, demonstrating superior performance over SOTA methods.
Related papers
- Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion [9.915632806109555]
We propose a novel VLM-Guided Degradation-Coupled Fusion network (VGDCFusion)<n>Our VGDCFusion significantly outperforms existing state-of-the-art fusion approaches under various degraded image scenarios.
arXiv Detail & Related papers (2025-10-13T14:26:33Z) - Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation [63.50827603618498]
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation.<n>Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis.<n>Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing.
arXiv Detail & Related papers (2025-09-23T17:05:46Z) - SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion [65.80051636480836]
This paper proposes a conditional diffusion model guided by the Segment Anything Model (SAM) to achieve high-fidelity and semantically-aware image fusion.<n>The framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks as a condition to drive the diffusion model's coarse-to-fine denoising generation.<n>Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations.
arXiv Detail & Related papers (2025-08-07T10:58:52Z) - ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts [82.52042409680267]
Current image fusion methods struggle to address the composite degradations encountered in real-world imaging scenarios.<n>We propose a controllable image fusion framework with language-vision prompts, termed ControlFusion.<n>In experiments, ControlFusion outperforms SOTA fusion methods in fusion quality and degradation handling.
arXiv Detail & Related papers (2025-03-30T08:18:53Z) - MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance [11.023241681116295]
We propose a unified framework for multi-task, multi-degradation, and language-guided image fusion.<n>Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts.
arXiv Detail & Related papers (2025-03-19T07:20:02Z) - Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion [13.949209965987308]
All-in-One Degradation-Aware Fusion Models (ADFMs) address complex scenes by mitigating degradations from source images and generating high-quality fused images.<n>Mainstream ADFMs often rely on highly synthetic multi-modal multi-quality images for supervision, limiting their effectiveness in cross-modal and rare degradation scenarios.<n>We present LURE, a Learning-driven Unified Representation model for infrared and visible Image Fusion, which is degradation-aware.
arXiv Detail & Related papers (2025-03-10T08:16:36Z) - Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding [67.57487747508179]
Multiple-in-one image restoration (IR) has made significant progress, aiming to handle all types of single degraded image restoration with a single model.
In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations.
arXiv Detail & Related papers (2024-11-25T09:26:34Z) - All-weather Multi-Modality Image Fusion: Unified Framework and 100k Benchmark [42.49073228252726]
Multi-modality image fusion (MMIF) combines complementary information from different image modalities to provide a more comprehensive and objective interpretation of scenes.
Existing MMIF methods lack the ability to resist different weather interferences in real-world scenes, preventing them from being useful in practical applications such as autonomous driving.
We propose an all-weather MMIF model to achieve effective multi-tasking in this context.
Experimental results in both real-world and synthetic scenes show that the proposed algorithm excels in detail recovery and multi-modality feature extraction.
arXiv Detail & Related papers (2024-02-03T09:02:46Z) - Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for
Loss-free Multi-Exposure Image Fusion [60.221404321514086]
Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels.
This paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions.
arXiv Detail & Related papers (2023-09-03T08:07:26Z) - Searching a Compact Architecture for Robust Multi-Exposure Image Fusion [55.37210629454589]
Two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference.
This study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion.
The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.