MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion
- URL: http://arxiv.org/abs/2509.12901v1
- Date: Tue, 16 Sep 2025 09:58:06 GMT
- Title: MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion
- Authors: Guihui Li, Bowei Dong, Kaizhi Dong, Jiayi Li, Haiyong Zheng,
- Abstract summary: We introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery.<n>By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations.<n>It delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.
- Score: 10.160499805076755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.
Related papers
- Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion [14.3937321254743]
We propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT)<n>A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models.<n>A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task.<n>An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features.
arXiv Detail & Related papers (2026-01-05T08:00:03Z) - Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z) - Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision [8.898422193366354]
We propose a novel end-to-end fusion framework that explicitly models inter-modality interaction and enhances task-critical regions.<n> FusionNet introduces a modality-aware attention mechanism that dynamically adjusts the contribution of infrared and visible features.<n>Experiments on the public M3FD dataset demonstrate that FusionNet generates fused images with enhanced semantic preservation, high perceptual quality, and clear interpretability.
arXiv Detail & Related papers (2025-09-14T23:44:15Z) - TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion [55.34830989105704]
Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities.<n>We introduce textual semantics at two levels: the mask semantic level and the text semantic level.<n>We propose Textual Semantic Guidance for infrared and visible image fusion, which guides the image synthesis process.
arXiv Detail & Related papers (2025-06-20T03:53:07Z) - DFVO: Learning Darkness-free Visible and Infrared Image Disentanglement and Fusion All at Once [57.15043822199561]
A Darkness-Free network is proposed to handle Visible and infrared image disentanglement and fusion all at Once (DFVO)<n>DFVO employs a cascaded multi-task approach to replace the traditional two-stage cascaded training (enhancement and fusion)<n>Our proposed approach outperforms state-of-the-art alternatives in terms of qualitative and quantitative evaluations.
arXiv Detail & Related papers (2025-05-07T15:59:45Z) - UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion Framework [18.30261731071375]
We propose a unified framework for infrared and visible images fusion in complex scenes, termed UMCFuse.<n>We classify the pixels of visible images from the degree of scattering of light transmission, allowing us to separate fine details from overall intensity.
arXiv Detail & Related papers (2024-02-03T09:27:33Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion [68.78897015832113]
We propose a coupled contrastive learning network, dubbed CoCoNet, to realize infrared and visible image fusion.<n>Our method achieves state-of-the-art (SOTA) performance under both subjective and objective evaluation.
arXiv Detail & Related papers (2022-11-20T12:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.