TextFusion: Unveiling the Power of Textual Semantics for Controllable
Image Fusion
- URL: http://arxiv.org/abs/2312.14209v2
- Date: Thu, 8 Feb 2024 11:43:57 GMT
- Title: TextFusion: Unveiling the Power of Textual Semantics for Controllable
Image Fusion
- Authors: Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Hui Li, Xi Li, Zhangyong
Tang, Josef Kittler
- Abstract summary: We propose a text-guided fusion paradigm for advanced image fusion.
We release a text-annotated image fusion dataset IVT.
Our approach consistently outperforms traditional appearance-based fusion methods.
- Score: 38.61215361212626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advanced image fusion methods are devoted to generating the fusion results by
aggregating the complementary information conveyed by the source images.
However, the difference in the source-specific manifestation of the imaged
scene content makes it difficult to design a robust and controllable fusion
process. We argue that this issue can be alleviated with the help of
higher-level semantics, conveyed by the text modality, which should enable us
to generate fused images for different purposes, such as visualisation and
downstream tasks, in a controllable way. This is achieved by exploiting a
vision-and-language model to build a coarse-to-fine association mechanism
between the text and image signals. With the guidance of the association maps,
an affine fusion unit is embedded in the transformer network to fuse the text
and vision modalities at the feature level. As another ingredient of this work,
we propose the use of textual attention to adapt image quality assessment to
the fusion task. To facilitate the implementation of the proposed text-guided
fusion paradigm, and its adoption by the wider research community, we release a
text-annotated image fusion dataset IVT. Extensive experiments demonstrate that
our approach (TextFusion) consistently outperforms traditional appearance-based
fusion methods. Our code and dataset will be publicly available at
https://github.com/AWCXV/TextFusion.
Related papers
- Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model [30.739879255847946]
Existing multi-modal image fusion methods fail to address the compound degradations presented in source images.
This study proposes a novel interactive multi-modal image fusion framework based on the text-modulated diffusion model, called Text-DiFuse.
arXiv Detail & Related papers (2024-10-31T13:10:50Z) - Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond [74.96466744512992]
The essence of image fusion is to integrate complementary information from source images.
DeFusion++ produces versatile fused representations that can enhance the quality of image fusion and the effectiveness of downstream high-level vision tasks.
arXiv Detail & Related papers (2024-10-16T06:28:49Z) - An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models [18.184158874126545]
We investigate how different fusion strategies can affect vision-language alignment.
A specially designed intermediate fusion can boost text-to-image alignment with improved generation quality.
Our model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed.
arXiv Detail & Related papers (2024-03-25T08:16:06Z) - Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion [26.809259323430368]
We introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task, termed as Text-IF.
Text-IF is accessible to the all-in-one infrared and visible image degradation-aware processing and the interactive flexible fusion outcomes.
In this way, Text-IF achieves not only multi-modal image fusion, but also multi-modal information fusion.
arXiv Detail & Related papers (2024-03-25T03:06:45Z) - Image Fusion via Vision-Language Model [91.36809431547128]
We introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM)
FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions.
These descriptions are fused within the textual domain and guide the visual information fusion.
FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion.
arXiv Detail & Related papers (2024-02-03T18:36:39Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - A Task-guided, Implicitly-searched and Meta-initialized Deep Model for
Image Fusion [69.10255211811007]
We present a Task-guided, Implicit-searched and Meta- generalizationd (TIM) deep model to address the image fusion problem in a challenging real-world scenario.
Specifically, we propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion.
Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency.
arXiv Detail & Related papers (2023-05-25T08:54:08Z) - Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.