A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning
- URL: http://arxiv.org/abs/2409.18467v3
- Date: Sat, 12 Oct 2024 04:15:09 GMT
- Title: A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning
- Authors: Swadhin Das, Raksha Sharma,
- Abstract summary: We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs.
The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels.
We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks.
- Score: 0.15346678870160887
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Remote sensing images are highly valued for their ability to address complex real-world issues such as risk management, security, and meteorology. However, manually captioning these images is challenging and requires specialized knowledge across various domains. This letter presents an approach for automatically describing (captioning) remote sensing images. We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs. The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels. Furthermore, we advance our approach with a comparison-based beam search method to ensure fairness in the search strategy for generating the final caption. We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks. We evaluated our method across three datasets using seven metrics: BLEU-1 to BLEU-4, METEOR, ROUGE-L, and CIDEr. The results demonstrate that our approach significantly outperforms other state-of-the-art encoder-decoder methods.
Related papers
- FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning [0.15346678870160887]
This paper introduces a novel approach that integrates features from two distinct CNN based encoders.
We also propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder.
The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
arXiv Detail & Related papers (2025-02-13T12:54:13Z) - Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [70.98890307376548]
We propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents during training.
Our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning.
arXiv Detail & Related papers (2024-12-31T13:39:08Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Perceptual Image Compression with Cooperative Cross-Modal Side
Information [53.356714177243745]
We propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff.
Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features.
arXiv Detail & Related papers (2023-11-23T08:31:11Z) - Adjacent Context Coordination Network for Salient Object Detection in
Optical Remote Sensing Images [102.75699068451166]
We propose a novel Adjacent Context Coordination Network (ACCoNet) to explore the coordination of adjacent features in an encoder-decoder architecture for optical RSI-SOD.
The proposed ACCoNet outperforms 22 state-of-the-art methods under nine evaluation metrics, and runs up to 81 fps on a single NVIDIA Titan X GPU.
arXiv Detail & Related papers (2022-03-25T14:14:55Z) - Empirical Analysis of Image Caption Generation using Deep Learning [0.0]
We have implemented and experimented with various flavors of multi-modal image captioning networks.
The goal is to analyze the performance of each approach using various evaluation metrics.
arXiv Detail & Related papers (2021-05-14T05:38:13Z) - A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning [32.11006090613004]
We deal with the problem of generating textual captions from optical remote sensing (RS) images using the notion of deep reinforcement learning.
We introduce an Actor Dual-Critic training strategy where a second critic model is deployed in the form of an encoder-decoder RNN.
We observe that the proposed model generates sentences on the test data highly similar to the ground truth and is successful in generating even better captions in many critical cases.
arXiv Detail & Related papers (2020-10-05T13:35:02Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition [17.191496890376197]
We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
arXiv Detail & Related papers (2020-05-22T03:02:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.