Related papers: Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

URL: http://arxiv.org/abs/2407.14032v1
Date: Fri, 19 Jul 2024 05:07:41 GMT
Title: Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance
Authors: Yongshuo Zhu, Lu Li, Keyan Chen, Chenyang Liu, Fugen Zhou, Zhenwei Shi,
Abstract summary: We introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance. We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets.
Score: 19.663899648983417
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Remote sensing image change captioning (RSICC) aims to articulate the changes in objects of interest within bi-temporal remote sensing images using natural language. Given the limitations of current RSICC methods in expressing general features across multi-temporal and spatial scenarios, and their deficiency in providing granular, robust, and precise change descriptions, we introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance, which we term Semantic-CC. Semantic-CC alleviates the dependency of high-generalization algorithms on extensive annotations by harnessing the latent knowledge of foundation models, and it generates more comprehensive and accurate change descriptions guided by pixel-level semantics from change detection (CD). Specifically, we propose a bi-temporal SAM-based encoder for dual-image feature extraction; a multi-task semantic aggregation neck for facilitating information interaction between heterogeneous tasks; a straightforward multi-scale change detection decoder to provide pixel-level semantic guidance; and a change caption decoder based on the large language model (LLM) to generate change description sentences. Moreover, to ensure the stability of the joint training of CD and CC, we propose a three-stage training strategy that supervises different tasks at various stages. We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets. The experimental results corroborate the complementarity of CD and CC, demonstrating that Semantic-CC can generate more accurate change descriptions and achieve optimal performance across both tasks.

Related papers

Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z)
Semantic-CD: Remote Sensing Image Semantic Change Detection towards Open-vocabulary Setting [19.663899648983417]
Traditional change detection methods often face challenges in generalizing across semantic categories in practical scenarios. We introduce a novel approach called Semantic-CD, specifically designed for semantic change detection in remote sensing images. By utilizing CLIP's extensive vocabulary knowledge, our model enhances its ability to generalize across categories.
arXiv Detail & Related papers (2025-01-12T13:22:11Z)
HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior [62.04939047885834]
We present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed spatial-CLIP Map.
arXiv Detail & Related papers (2024-11-27T15:22:44Z)
Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning [49.24306593078429]
We propose a novel framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI) KCFI includes a ViTs encoder for extracting bi-temporal remote sensing image features, a key feature perceiver for identifying critical change areas, and a pixel-level change detection decoder. To validate the effectiveness of our approach, we compare it against several state-of-the-art change captioning methods on the LEVIR-CC dataset.
arXiv Detail & Related papers (2024-09-19T09:33:33Z)
Trustworthy Image Semantic Communication with GenAI: Explainablity, Controllability, and Efficiency [59.15544887307901]
Image semantic communication (ISC) has garnered significant attention for its potential to achieve high efficiency in visual content transmission. Existing ISC systems based on joint source-channel coding face challenges in interpretability, operability, and compatibility. We propose a novel trustworthy ISC framework that employs Generative Artificial Intelligence (GenAI) for multiple downstream inference tasks.
arXiv Detail & Related papers (2024-08-07T14:32:36Z)
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics. We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z)
Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain. With only a single representative text feature instead of real images, the synthesized images gradually lose diversity. We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z)
Align, Perturb and Decouple: Toward Better Leverage of Difference Information for RSI Change Detection [24.249552791014644]
Change detection is a widely adopted technique in remote sense imagery (RSI) analysis. We propose a series of operations to fully exploit the difference information: Alignment, Perturbation and Decoupling.
arXiv Detail & Related papers (2023-05-30T03:39:53Z)
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP. Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation [7.781425222538382]
DiverGAN is a framework to generate diverse, plausible and semantically consistent images according to a natural-language description. DiverGAN adopts two novel word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM) Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to enable the linguistic cues from the sentence embedding to flexibly manipulate the amount of change in shape and texture.
arXiv Detail & Related papers (2021-11-17T17:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.