Related papers: MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection

MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection

URL: http://arxiv.org/abs/2508.01555v1
Date: Sun, 03 Aug 2025 02:50:08 GMT
Title: MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection
Authors: Chengming Wang, Guodong Fan, Jinjiang Li, Min Gan, C. L. Philip Chen,
Abstract summary: We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
Score: 55.702662643521265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advancement of remote sensing satellite technology and the rapid progress of deep learning, remote sensing change detection (RSCD) has become a key technique for regional monitoring. Traditional change detection (CD) methods and deep learning-based approaches have made significant contributions to change analysis and detection, however, many outstanding methods still face limitations in the exploration and application of multimodal data. To address this, we propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to further explore the semantic interaction capabilities of multimodal data. Multimodal large language models (MLLM) have attracted widespread attention for their outstanding performance in computer vision, particularly due to their powerful visual-language understanding and dialogic interaction capabilities. Specifically, we design a MLLM-based optimization strategy to generate multimodal textual data from the original CD images, which serve as textual input to MGCR. Visual and textual features are extracted through a dual encoder framework. For the first time in the RSCD task, we introduce a multimodal graph-conditioned vision-language reconstruction mechanism, which is integrated with graph attention to construct a semantic graph-conditioned reconstruction module (SGCM), this module generates vision-language (VL) tokens through graph-based conditions and enables cross-dimensional interaction between visual and textual features via multihead attention. The reconstructed VL features are then deeply fused using the language vision transformer (LViT), achieving fine-grained feature alignment and high-level semantic interaction. Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods. Our code is available on https://github.com/cn-xvkong/MGCR

Related papers

True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z)
Unified Multimodal Understanding via Byte-Pair Visual Encoding [34.96534298857146]
Multimodal large language models (MLLMs) have made significant progress in vision-language understanding.<n>We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens.
arXiv Detail & Related papers (2025-06-30T09:08:08Z)
Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [7.992331117310217]
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation.<n>We design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities.
arXiv Detail & Related papers (2025-03-14T08:31:21Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.<n> RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.<n>We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing [16.755590790629153]
This review examines the development and application of multi-modal language models (MLLMs) in remote sensing. We focus on their ability to interpret and describe satellite imagery using natural language. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed.
arXiv Detail & Related papers (2024-11-05T12:14:22Z)
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields. Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion. We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
Dynamic Context-guided Capsule Network for Multimodal Machine Translation [131.37130887834667]
Multimodal machine translation (MMT) mainly focuses on enhancing text-only translation with visual features. We propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN.
arXiv Detail & Related papers (2020-09-04T06:18:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.