Related papers: MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

URL: http://arxiv.org/abs/2503.14944v1
Date: Wed, 19 Mar 2025 07:20:02 GMT
Title: MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance
Authors: Zihan Cao, Yu Zhong, Ziqi Wang, Liang-Jian Deng,
Abstract summary: We propose a unified framework for multi-task, multi-degradation, and language-guided image fusion.<n>Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts.
Score: 11.023241681116295
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.

Related papers

Prompt-Free Conditional Diffusion for Multi-object Image Augmentation [45.92182911052815]
We propose a prompt-free conditional diffusion framework for multi-object image augmentation.<n>Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text.<n>We also design a reward model based counting loss to assist the traditional reconstruction loss for model training.
arXiv Detail & Related papers (2025-07-08T16:27:48Z)
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z)
Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion [13.949209965987308]
All-in-One Degradation-Aware Fusion Models (ADFMs) address complex scenes by mitigating degradations from source images and generating high-quality fused images.<n>Mainstream ADFMs often rely on highly synthetic multi-modal multi-quality images for supervision, limiting their effectiveness in cross-modal and rare degradation scenarios.<n>We present LURE, a Learning-driven Unified Representation model for infrared and visible Image Fusion, which is degradation-aware.
arXiv Detail & Related papers (2025-03-10T08:16:36Z)
Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond [74.96466744512992]
The essence of image fusion is to integrate complementary information from source images. DeFusion++ produces versatile fused representations that can enhance the quality of image fusion and the effectiveness of downstream high-level vision tasks.
arXiv Detail & Related papers (2024-10-16T06:28:49Z)
DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. layout is employed as an intermedium to bridge large language models and layout-based diffusion models. We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z)
Exposure Bracketing Is All You Need For A High-Quality Image [50.822601495422916]
Multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution.<n>We propose to utilize exposure bracketing photography to get a high-quality image by combining these tasks in this work.<n>In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed.
arXiv Detail & Related papers (2024-01-01T14:14:35Z)
Multi-task Image Restoration Guided By Robust DINO Features [88.74005987908443]
We propose mboxtextbfDINO-IR, a multi-task image restoration approach leveraging robust features extracted from DINOv2. We first propose a pixel-semantic fusion (PSF) module to dynamically fuse DINOV2's shallow features. By formulating these modules into a unified deep model, we propose a DINO perception contrastive loss to constrain the model training.
arXiv Detail & Related papers (2023-12-04T06:59:55Z)
Mutual Information-driven Triple Interaction Network for Efficient Image Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing. The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal. The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z)
A Task-guided, Implicitly-searched and Meta-initialized Deep Model for Image Fusion [69.10255211811007]
We present a Task-guided, Implicit-searched and Meta- generalizationd (TIM) deep model to address the image fusion problem in a challenging real-world scenario. Specifically, we propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion. Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency.
arXiv Detail & Related papers (2023-05-25T08:54:08Z)
TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning [5.849513679510834]
Image fusion is a technique to integrate information from multiple source images with complementary information to improve the richness of a single image. Two-stage methods avoid the need of large amount of task-specific training data by training encoder-decoder network on large natural image datasets. We propose a destruction-reconstruction based self-supervised training scheme to encourage the network to learn task-specific features.
arXiv Detail & Related papers (2022-01-19T07:30:44Z)
InfinityGAN: Towards Infinite-Resolution Image Synthesis [92.40782797030977]
We present InfinityGAN, a method to generate arbitrary-resolution images. We show how it trains and infers patch-by-patch seamlessly with low computational resources.
arXiv Detail & Related papers (2021-04-08T17:59:30Z)
Gated Fusion Network for Degraded Image Super Resolution [78.67168802945069]
We propose a dual-branch convolutional neural network to extract base features and recovered features separately. By decomposing the feature extraction step into two task-independent streams, the dual-branch model can facilitate the training process.
arXiv Detail & Related papers (2020-03-02T13:28:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.