Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
Captioning
- URL: http://arxiv.org/abs/2312.01191v1
- Date: Sat, 2 Dec 2023 17:32:17 GMT
- Title: Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
Captioning
- Authors: Cong Yang, Zuchao Li, Lefei Zhang
- Abstract summary: We propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA.
Specifically, the first stage involves preliminary alignment through image-text contrastive learning.
In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model.
- Score: 49.48946808024608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, remote sensing image captioning has gained significant attention in
the remote sensing community. Due to the significant differences in spatial
resolution of remote sensing images, existing methods in this field have
predominantly concentrated on the fine-grained extraction of remote sensing
image features, but they cannot effectively handle the semantic consistency
between visual features and textual features. To efficiently align the
image-text, we propose a novel two-stage vision-language pre-training-based
approach to bootstrap interactive image-text alignment for remote sensing image
captioning, called BITA, which relies on the design of a lightweight
interactive Fourier Transformer to better align remote sensing image-text
features. The Fourier layer in the interactive Fourier Transformer is capable
of extracting multi-scale features of remote sensing images in the frequency
domain, thereby reducing the redundancy of remote sensing visual features.
Specifically, the first stage involves preliminary alignment through image-text
contrastive learning, which aligns the learned multi-scale remote sensing
features from the interactive Fourier Transformer with textual features. In the
second stage, the interactive Fourier Transformer connects the frozen image
encoder with a large language model. Then, prefix causal language modeling is
utilized to guide the text generation process using visual features.
Ultimately, across the UCM-caption, RSICD, and NWPU-caption datasets, the
experimental results clearly demonstrate that BITA outperforms other advanced
comparative approaches. The code is available at
https://github.com/yangcong356/BITA.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval [37.775529830620016]
Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain.
Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately.
We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation.
arXiv Detail & Related papers (2024-05-29T10:19:11Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - Remote Sensing Vision-Language Foundation Models without Annotations via
Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z) - TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - Changes to Captions: An Attentive Network for Remote Sensing Change
Captioning [15.986576036345333]
This study highlights the significance of accurately describing changes in remote sensing images.
We propose an attentive changes-to-captions network, called Chg2Cap for short, for bi-temporal remote sensing images.
The proposed Chg2Cap network is evaluated on two representative remote sensing datasets.
arXiv Detail & Related papers (2023-04-03T15:51:42Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.