Changes to Captions: An Attentive Network for Remote Sensing Change
Captioning
- URL: http://arxiv.org/abs/2304.01091v2
- Date: Thu, 26 Oct 2023 09:37:16 GMT
- Title: Changes to Captions: An Attentive Network for Remote Sensing Change
Captioning
- Authors: Shizhen Chang and Pedram Ghamisi
- Abstract summary: This study highlights the significance of accurately describing changes in remote sensing images.
We propose an attentive changes-to-captions network, called Chg2Cap for short, for bi-temporal remote sensing images.
The proposed Chg2Cap network is evaluated on two representative remote sensing datasets.
- Score: 15.986576036345333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, advanced research has focused on the direct learning and
analysis of remote sensing images using natural language processing (NLP)
techniques. The ability to accurately describe changes occurring in
multi-temporal remote sensing images is becoming increasingly important for
geospatial understanding and land planning. Unlike natural image change
captioning tasks, remote sensing change captioning aims to capture the most
significant changes, irrespective of various influential factors such as
illumination, seasonal effects, and complex land covers. In this study, we
highlight the significance of accurately describing changes in remote sensing
images and present a comparison of the change captioning task for natural and
synthetic images and remote sensing images. To address the challenge of
generating accurate captions, we propose an attentive changes-to-captions
network, called Chg2Cap for short, for bi-temporal remote sensing images. The
network comprises three main components: 1) a Siamese CNN-based feature
extractor to collect high-level representations for each image pair; 2) an
attentive decoder that includes a hierarchical self-attention block to locate
change-related features and a residual block to generate the image embedding;
and 3) a transformer-based caption generator to decode the relationship between
the image embedding and the word embedding into a description. The proposed
Chg2Cap network is evaluated on two representative remote sensing datasets, and
a comprehensive experimental analysis is provided. The code and pre-trained
models will be available online at https://github.com/ShizhenChang/Chg2Cap.
Related papers
- Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning [49.24306593078429]
We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI)
KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder.
To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
arXiv Detail & Related papers (2024-09-19T09:33:33Z) - Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
Captioning [49.48946808024608]
We propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA.
Specifically, the first stage involves preliminary alignment through image-text contrastive learning.
In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model.
arXiv Detail & Related papers (2023-12-02T17:32:17Z) - Explicit Change Relation Learning for Change Detection in VHR Remote
Sensing Images [12.228675703851733]
We propose a network architecture NAME for the explicit mining of change relation features.
The change features of change detection should be divided into pre-changed image features, post-changed image features and change relation features.
Our network performs better, in terms of F1, IoU, and OA, than those of the existing advanced networks for change detection.
arXiv Detail & Related papers (2023-11-14T08:47:38Z) - TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - VcT: Visual change Transformer for Remote Sensing Image Change Detection [16.778418602705287]
We propose a novel Visual change Transformer (VcT) model for visual change detection problem.
Top-K reliable tokens can be mined from the map and refined by using the clustering algorithm.
Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.
arXiv Detail & Related papers (2023-10-17T17:25:31Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.