Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning
- URL: http://arxiv.org/abs/2409.12612v1
- Date: Thu, 19 Sep 2024 09:33:33 GMT
- Title: Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning
- Authors: Cong Yang, Zuchao Li, Hongzan Jiao, Zhi Gao, Lefei Zhang,
- Abstract summary: We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI)
KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder.
To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
- Score: 49.24306593078429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, while significant progress has been made in remote sensing image change captioning, existing methods fail to filter out areas unrelated to actual changes, making models susceptible to irrelevant features. In this article, we propose a novel multimodal framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI). This framework aims to fully leverage the intrinsic knowledge of large language models through visual instructions and enhance the effectiveness and accuracy of change features using pixel-level change detection tasks. Specifically, KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, a pixel-level change detection decoder to constrain key change features, and an instruction-tuned decoder based on a large language model. Moreover, to ensure that change description and change detection tasks are jointly optimized, we employ a dynamic weight-averaging strategy to balance the losses between the two tasks. We also explore various feature combinations for visual fine-tuning instructions and demonstrate that using only key change features to guide the large language model is the optimal choice. To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset, achieving the best performance. Our code will be available at https://github.com/yangcong356/KCFI.git.
Related papers
- ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning [0.846600473226587]
We introduce ChangeChat, the first bitemporal vision-language model (VLM) designed specifically for RS change analysis.
ChangeChat utilizes multimodal instruction tuning, allowing it to handle complex queries such as change captioning, category-specific quantification, and change localization.
Experiments show that ChangeChat offers a comprehensive, interactive solution for RS change analysis, achieving performance comparable to or even better than state-of-the-art (SOTA) methods on specific tasks.
arXiv Detail & Related papers (2024-09-13T07:00:44Z) - Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance [19.663899648983417]
We introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance.
We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets.
arXiv Detail & Related papers (2024-07-19T05:07:41Z) - ChangeViT: Unleashing Plain Vision Transformers for Change Detection [3.582733645632794]
ChangeViT is a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes.
The framework achieves state-of-the-art performance on three popular high-resolution datasets.
arXiv Detail & Related papers (2024-06-18T17:59:08Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - MS-Former: Memory-Supported Transformer for Weakly Supervised Change
Detection with Patch-Level Annotations [50.79913333804232]
We propose a memory-supported transformer (MS-Former) for weakly supervised change detection.
MS-Former consists of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS)
Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task.
arXiv Detail & Related papers (2023-11-16T09:57:29Z) - TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - VcT: Visual change Transformer for Remote Sensing Image Change Detection [16.778418602705287]
We propose a novel Visual change Transformer (VcT) model for visual change detection problem.
Top-K reliable tokens can be mined from the map and refined by using the clustering algorithm.
Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.
arXiv Detail & Related papers (2023-10-17T17:25:31Z) - Changes-Aware Transformer: Learning Generalized Changes Representation [56.917000244470174]
We propose a novel Changes-Aware Transformer (CAT) for refining difference features.
The generalized representation of various changes is learned straightforwardly in the difference feature space.
After refinement, the changed pixels in the difference feature space are closer to each other, which facilitates change detection.
arXiv Detail & Related papers (2023-09-24T12:21:57Z) - Changes to Captions: An Attentive Network for Remote Sensing Change
Captioning [15.986576036345333]
This study highlights the significance of accurately describing changes in remote sensing images.
We propose an attentive changes-to-captions network, called Chg2Cap for short, for bi-temporal remote sensing images.
The proposed Chg2Cap network is evaluated on two representative remote sensing datasets.
arXiv Detail & Related papers (2023-04-03T15:51:42Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.