FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning
- URL: http://arxiv.org/abs/2502.09282v1
- Date: Thu, 13 Feb 2025 12:54:13 GMT
- Title: FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning
- Authors: Swadhin Das, Raksha Sharma,
- Abstract summary: This paper introduces a novel approach that integrates features from two distinct CNN based encoders.
We also propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder.
The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
- Score: 0.15346678870160887
- License:
- Abstract: Remote sensing image captioning aims to generate descriptive text from remote sensing images, typically employing an encoder-decoder framework. In this setup, a convolutional neural network (CNN) extracts feature representations from the input image, which then guide the decoder in a sequence-to-sequence caption generation process. Although much research has focused on refining the decoder, the quality of image representations from the encoder remains crucial for accurate captioning. This paper introduces a novel approach that integrates features from two distinct CNN based encoders, capturing complementary information to enhance caption generation. Additionally, we propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder. Furthermore, a comparison-based beam search strategy is incorporated to refine caption selection. The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
Related papers
- Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [70.98890307376548]
We propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents during training.
Our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning.
arXiv Detail & Related papers (2024-12-31T13:39:08Z) - A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning [0.15346678870160887]
We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs.
The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels.
We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks.
arXiv Detail & Related papers (2024-09-27T06:12:31Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU) [0.0]
We present a CNN-GRU encoder decode framework for caption-to-image reconstructor.
It handles the semantic context into consideration as well as the time complexity.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
arXiv Detail & Related papers (2023-01-06T10:00:06Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Representation and Correlation Enhanced Encoder-Decoder Framework for
Scene Text Recognition [10.496558786568672]
We propose a Representation and Correlation Enhanced-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck.
In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map.
In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space.
arXiv Detail & Related papers (2021-06-13T10:36:56Z) - Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for
Fine-Resolution Remote Sensing Images [6.171417925832851]
We introduce the Swin Transformer as the backbone to fully extract the context information.
We also design a novel decoder named densely connected feature aggregation module (DCFAM) to restore the resolution and generate the segmentation map.
arXiv Detail & Related papers (2021-04-25T11:34:22Z) - A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning [32.11006090613004]
We deal with the problem of generating textual captions from optical remote sensing (RS) images using the notion of deep reinforcement learning.
We introduce an Actor Dual-Critic training strategy where a second critic model is deployed in the form of an encoder-decoder RNN.
We observe that the proposed model generates sentences on the test data highly similar to the ground truth and is successful in generating even better captions in many critical cases.
arXiv Detail & Related papers (2020-10-05T13:35:02Z) - Modeling Lost Information in Lossy Image Compression [72.69327382643549]
Lossy image compression is one of the most commonly used operators for digital images.
We propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem.
arXiv Detail & Related papers (2020-06-22T04:04:56Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.