Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection
- URL: http://arxiv.org/abs/2509.03961v1
- Date: Thu, 04 Sep 2025 07:39:18 GMT
- Title: Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection
- Authors: Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, C. L. Philip Chen,
- Abstract summary: MMChange is a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness.<n>To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images.<n>A Textual Difference Enhancement (TDE) module captures fine grained semantic shifts, guiding the model toward meaningful changes.
- Score: 36.96267014127019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.
Related papers
- Referring Change Detection in Remote Sensing Imagery [49.841833753558575]
We introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images.<n>We propose a two-stage framework consisting of (I) textbfRCDNet, a cross-modal fusion network designed for referring change detection, and (II) textbfRCDGen, a diffusion-based synthetic data generation pipeline.
arXiv Detail & Related papers (2025-12-12T16:57:12Z) - LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation [9.324344835427858]
We propose a novel Language-Guided Change Detection model (LG-CD)<n>This model leverages natural language prompts to direct the network's attention to regions of interest.<n>Our experiments on three datasets demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods.
arXiv Detail & Related papers (2025-09-26T05:30:11Z) - MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z) - OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model [8.619958921346184]
Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis.<n>We propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation.<n> Experiments demonstrate superior accuracy and efficiency across various multimodal registration tasks.
arXiv Detail & Related papers (2025-04-08T13:32:56Z) - LDGNet: A Lightweight Difference Guiding Network for Remote Sensing Change Detection [6.554696547472252]
We propose a Lightweight Difference Guiding Network (LDGNet) to guide optical remote sensing change detection.<n>First, to enhance the feature representation capability of the lightweight backbone network, we propose the Difference Guiding Module (DGM)<n>Second, we propose the Difference-Aware Dynamic Fusion (DADF) module with Visual State Space Model (VSSM) for lightweight long-range dependency modeling.
arXiv Detail & Related papers (2025-04-07T13:33:54Z) - Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [7.992331117310217]
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation.<n>We design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities.
arXiv Detail & Related papers (2025-03-14T08:31:21Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning [49.24306593078429]
We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI)
KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder.
To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
arXiv Detail & Related papers (2024-09-19T09:33:33Z) - Cross-Domain Separable Translation Network for Multimodal Image Change Detection [11.25422609271201]
multimodal change detection (MCD) is particularly critical in the remote sensing community.
This paper focuses on addressing the challenges of MCD, especially the difficulty in comparing images from different sensors.
A novel unsupervised cross-domain separable translation network (CSTN) is proposed to overcome these limitations.
arXiv Detail & Related papers (2024-07-23T03:56:02Z) - TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Fully Transformer Network for Change Detection of Remote Sensing Images [22.989324947501014]
We propose a novel learning framework named Fully Transformer Network (FTN) for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four public CD benchmarks.
arXiv Detail & Related papers (2022-10-03T08:21:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.