Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach
- URL: http://arxiv.org/abs/2501.08114v1
- Date: Tue, 14 Jan 2025 13:46:03 GMT
- Title: Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach
- Authors: Yuduo Wang, Weikang Yu, Pedram Ghamisi,
- Abstract summary: Existing change captioning methods face two key challenges: high computational demands and insufficient detail in object descriptions.
We propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning.
In particular, SAT-Cap integrates a Spatial-Channel Attention, a Difference-Guided Fusion module, and a Caption Decoder.
- Score: 11.699082207670815
- License:
- Abstract: Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.
Related papers
- LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection [8.24662649122549]
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query.
Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information.
We propose the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks.
arXiv Detail & Related papers (2025-01-18T14:54:56Z) - MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption [8.062368743143388]
We introduce a novel video model-based paradigm without design of the fusion module.
Specifically, we use the off-the-shelf video encoder to simultaneously extract the temporal and spatial features of bi-temporal images.
Our proposed method can obtain better performance compared with other state-of-the-art RSICC methods.
arXiv Detail & Related papers (2024-10-31T14:02:40Z) - UTSRMorph: A Unified Transformer and Superresolution Network for Unsupervised Medical Image Registration [4.068692674719378]
Complicated image registration is a key issue in medical image analysis.
We propose a novel unsupervised image registration method named the unified Transformer and superresolution (UTSRMorph) network.
arXiv Detail & Related papers (2024-10-27T06:28:43Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT)
The proposed ICT can capture both global and local contexts from the latent representations.
We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos [49.337912335944026]
We formulate the problem of Zero-Shot Sign Language Recognition (ZS- SLR) and propose a two-stream model from two input modalities: RGB and Depth videos.
To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation.
Atemporal representation from human body is obtained using vision Transformer and a LSTM network.
arXiv Detail & Related papers (2021-08-23T10:48:18Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.