A Deep Neural Framework for Image Caption Generation Using GRU-Based
Attention Mechanism
- URL: http://arxiv.org/abs/2203.01594v1
- Date: Thu, 3 Mar 2022 09:47:59 GMT
- Title: A Deep Neural Framework for Image Caption Generation Using GRU-Based
Attention Mechanism
- Authors: Rashid Khan, M Shujah Islam, Khadija Kanwal, Mansoor Iqbal, Md. Imran
Hossain, and Zhongfu Ye
- Abstract summary: This study aims to develop a system that uses a pre-trained convolutional neural network (CNN) to extract features from an image, integrates the features with an attention mechanism, and creates captions using a recurrent neural network (RNN)
On the MSCOCO dataset, the experimental results achieve competitive performance against state-of-the-art approaches.
- Score: 5.855671062331371
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Image captioning is a fast-growing research field of computer vision and
natural language processing that involves creating text explanations for
images. This study aims to develop a system that uses a pre-trained
convolutional neural network (CNN) to extract features from an image,
integrates the features with an attention mechanism, and creates captions using
a recurrent neural network (RNN). To encode an image into a feature vector as
graphical attributes, we employed multiple pre-trained convolutional neural
networks. Following that, a language model known as GRU is chosen as the
decoder to construct the descriptive sentence. In order to increase
performance, we merge the Bahdanau attention model with GRU to allow learning
to be focused on a specific portion of the image. On the MSCOCO dataset, the
experimental results achieve competitive performance against state-of-the-art
approaches.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Image segmentation with traveling waves in an exactly solvable recurrent
neural network [71.74150501418039]
We show that a recurrent neural network can effectively divide an image into groups according to a scene's structural characteristics.
We present a precise description of the mechanism underlying object segmentation in this network.
We then demonstrate a simple algorithm for object segmentation that generalizes across inputs ranging from simple geometric objects in grayscale images to natural images.
arXiv Detail & Related papers (2023-11-28T16:46:44Z) - A Comparative Study of Pre-trained CNNs and GRU-Based Attention for
Image Caption Generation [9.490898534790977]
This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism.
Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate sentences.
arXiv Detail & Related papers (2023-10-11T07:30:01Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Recursive Neural Programs: Variational Learning of Image Grammars and
Part-Whole Hierarchies [1.5990720051907859]
We introduce Recursive Neural Programs (RNPs) to address the part-whole hierarchy learning problem.
RNPs are the first neural generative model to address the part-whole hierarchy learning problem.
Our results show that RNPs provide an intuitive and explainable way of composing objects and scenes.
arXiv Detail & Related papers (2022-06-16T22:02:06Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation.
In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network.
We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z) - Comparative evaluation of CNN architectures for Image Caption Generation [1.2183405753834562]
We have evaluated 17 different Convolutional Neural Networks on two popular Image Caption Generation frameworks.
We observe that model complexity of Convolutional Neural Network, as measured by number of parameters, and the accuracy of the model on Object Recognition task does not necessarily co-relate with its efficacy on feature extraction for Image Caption Generation task.
arXiv Detail & Related papers (2021-02-23T05:43:54Z) - NAS-DIP: Learning Deep Image Prior with Neural Architecture Search [65.79109790446257]
Recent work has shown that the structure of deep convolutional neural networks can be used as a structured image prior.
We propose to search for neural architectures that capture stronger image priors.
We search for an improved network by leveraging an existing neural architecture search algorithm.
arXiv Detail & Related papers (2020-08-26T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.