Related papers: ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

URL: http://arxiv.org/abs/2512.23244v1
Date: Mon, 29 Dec 2025 06:58:46 GMT
Title: ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing
Authors: Xingwei Ma, Shiyang Feng, Bo Zhang, Bin Wang,
Abstract summary: ViLaCD-R1 is a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD)<n>We show that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves imprecise state-of-the-art accuracy in complex real-world scenarios.
Score: 5.966253859501895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.

Related papers

Referring Change Detection in Remote Sensing Imagery [49.841833753558575]
We introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images.<n>We propose a two-stage framework consisting of (I) textbfRCDNet, a cross-modal fusion network designed for referring change detection, and (II) textbfRCDGen, a diffusion-based synthetic data generation pipeline.
arXiv Detail & Related papers (2025-12-12T16:57:12Z)
LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation [9.324344835427858]
We propose a novel Language-Guided Change Detection model (LG-CD)<n>This model leverages natural language prompts to direct the network's attention to regions of interest.<n>Our experiments on three datasets demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods.
arXiv Detail & Related papers (2025-09-26T05:30:11Z)
Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection [36.96267014127019]
MMChange is a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness.<n>To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images.<n>A Textual Difference Enhancement (TDE) module captures fine grained semantic shifts, guiding the model toward meaningful changes.
arXiv Detail & Related papers (2025-09-04T07:39:18Z)
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
S2C: Learning Noise-Resistant Differences for Unsupervised Change Detection in Multimodal Remote Sensing Images [24.75086641416994]
Untemporal Change Detection (UCD) in multimodal Remote Sensing (RS) images remains a difficult challenge.<n>Inspired by recent advancements in Visual Foundation Models (VFMs) and Contrastive Learning (CL) methodologies, this research aims to develop CL methodologies to translate implicit knowledge in representations into change.
arXiv Detail & Related papers (2025-02-18T07:34:54Z)
Semantic-CD: Remote Sensing Image Semantic Change Detection towards Open-vocabulary Setting [19.663899648983417]
Traditional change detection methods often face challenges in generalizing across semantic categories in practical scenarios.<n>We introduce a novel approach called Semantic-CD, specifically designed for semantic change detection in remote sensing images.<n>By utilizing CLIP's extensive vocabulary knowledge, our model enhances its ability to generalize across categories.
arXiv Detail & Related papers (2025-01-12T13:22:11Z)
Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers [58.80845404416028]
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy.<n>With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention.<n>We propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs.
arXiv Detail & Related papers (2024-12-21T09:30:45Z)
Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task. MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z)
Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance [19.663899648983417]
We introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance. We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets.
arXiv Detail & Related papers (2024-07-19T05:07:41Z)
TransY-Net:Learning Fully Transformer Networks for Change Detection of Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD. It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner. Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.