Beyond RNNs: Benchmarking Attention-Based Image Captioning Models
- URL: http://arxiv.org/abs/2502.18734v1
- Date: Wed, 26 Feb 2025 01:05:18 GMT
- Title: Beyond RNNs: Benchmarking Attention-Based Image Captioning Models
- Authors: Hemanth Teja Yanambakkam, Rahul Chinthala,
- Abstract summary: This study benchmarks the performance of attention-based image captioning models against RNN-based approaches.<n>We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions.<n>Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning is a challenging task at the intersection of computer vision and natural language processing, requiring models to generate meaningful textual descriptions of images. Traditional approaches rely on recurrent neural networks (RNNs), but recent advancements in attention mechanisms have demonstrated significant improvements. This study benchmarks the performance of attention-based image captioning models against RNN-based approaches using the MS-COCO dataset. We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions. The models are assessed using natural language processing metrics such as BLEU, METEOR, GLEU, and WER. Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions, with better alignment to human evaluation. This work provides insights into the impact of attention mechanisms in image captioning and highlights areas for future improvements.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - A Comparative Study of Pre-trained CNNs and GRU-Based Attention for
Image Caption Generation [9.490898534790977]
This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism.
Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate sentences.
arXiv Detail & Related papers (2023-10-11T07:30:01Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained
Image Categorization [24.286426387100423]
We propose a method that captures subtle changes by aggregating context-aware features from most relevant image-regions.
Our approach is inspired by the recent advancement in self-attention and graph neural networks (GNNs)
It outperforms the state-of-the-art approaches by a significant margin in recognition accuracy.
arXiv Detail & Related papers (2022-09-05T19:43:15Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - A Deep Neural Framework for Image Caption Generation Using GRU-Based
Attention Mechanism [5.855671062331371]
This study aims to develop a system that uses a pre-trained convolutional neural network (CNN) to extract features from an image, integrates the features with an attention mechanism, and creates captions using a recurrent neural network (RNN)
On the MSCOCO dataset, the experimental results achieve competitive performance against state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-03T09:47:59Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - CAGAN: Text-To-Image Generation with Combined Attention GANs [70.3497683558609]
We propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions.
The proposed CAGAN uses two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels.
With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset.
arXiv Detail & Related papers (2021-04-26T15:46:40Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.