Related papers: TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion

TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion

URL: http://arxiv.org/abs/2506.16730v1
Date: Fri, 20 Jun 2025 03:53:07 GMT
Title: TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion
Authors: Mingrui Zhu, Xiru Chen, Xin Wei, Nannan Wang, Xinbo Gao,
Abstract summary: Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities.<n>We introduce textual semantics at two levels: the mask semantic level and the text semantic level.<n>We propose Textual Semantic Guidance for infrared and visible image fusion, which guides the image synthesis process.
Score: 55.34830989105704
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities, producing more informative and comprehensive outputs. Recently, text-guided IVF has shown great potential due to its flexibility and versatility. However, the effective integration and utilization of textual semantic information remains insufficiently studied. To tackle these challenges, we introduce textual semantics at two levels: the mask semantic level and the text semantic level, both derived from textual descriptions extracted by large Vision-Language Models (VLMs). Building on this, we propose Textual Semantic Guidance for infrared and visible image fusion, termed TeSG, which guides the image synthesis process in a way that is optimized for downstream tasks such as detection and segmentation. Specifically, TeSG consists of three core components: a Semantic Information Generator (SIG), a Mask-Guided Cross-Attention (MGCA) module, and a Text-Driven Attentional Fusion (TDAF) module. The SIG generates mask and text semantics based on textual descriptions. The MGCA module performs initial attention-based fusion of visual features from both infrared and visible images, guided by mask semantics. Finally, the TDAF module refines the fusion process with gated attention driven by text semantics. Extensive experiments demonstrate the competitiveness of our approach, particularly in terms of performance on downstream tasks, compared to existing state-of-the-art methods.

Related papers

Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion [14.3937321254743]
We propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT)<n>A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models.<n>A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task.<n>An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features.
arXiv Detail & Related papers (2026-01-05T08:00:03Z)
Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z)
MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion [10.160499805076755]
We introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery.<n>By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations.<n>It delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.
arXiv Detail & Related papers (2025-09-16T09:58:06Z)
RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation [4.723262609467585]
Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process.<n>Existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome.<n>We observe that referring image segmentation and text-driven fusion share a common objective: highlighting the object referred to by the text.<n>Motivated by this, we propose RIS-Fusion, a cascaded framework that unifies fusion and RIS through joint optimization.
arXiv Detail & Related papers (2025-09-16T06:03:15Z)
Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)<n>AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z)
EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning [35.87830182497944]
In this paper, we work towards the textbfEntity-centric textbfImage-textbfText textbfMatching (EITM) problem.<n>The challenge of this task mainly lies in the larger semantic gap in entity association modeling.<n>We devise a multimodal attentive contrastive learning framework to adapt EITM problem, developing a model named EntityCLIP.
arXiv Detail & Related papers (2024-10-23T12:12:56Z)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities. We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z)
Image Fusion via Vision-Language Model [91.36809431547128]
We introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM) FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions. These descriptions are fused within the textual domain and guide the visual information fusion. FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion.
arXiv Detail & Related papers (2024-02-03T18:36:39Z)
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z)
TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion [38.61215361212626]
We propose a text-guided fusion paradigm for advanced image fusion. We release a text-annotated image fusion dataset IVT. Our approach consistently outperforms traditional appearance-based fusion methods.
arXiv Detail & Related papers (2023-12-21T09:25:10Z)
Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts. We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion. Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z)
Fine-grained Cross-modal Fusion based Refinement for Text-to-Image Synthesis [12.954663420736782]
We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN. The FF-GAN consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR)
arXiv Detail & Related papers (2023-02-17T05:44:05Z)
DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level. Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.