Comparative evaluation of CNN architectures for Image Caption Generation
- URL: http://arxiv.org/abs/2102.11506v1
- Date: Tue, 23 Feb 2021 05:43:54 GMT
- Title: Comparative evaluation of CNN architectures for Image Caption Generation
- Authors: Sulabh Katiyar, Samir Kumar Borgohain
- Abstract summary: We have evaluated 17 different Convolutional Neural Networks on two popular Image Caption Generation frameworks.
We observe that model complexity of Convolutional Neural Network, as measured by number of parameters, and the accuracy of the model on Object Recognition task does not necessarily co-relate with its efficacy on feature extraction for Image Caption Generation task.
- Score: 1.2183405753834562
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Aided by recent advances in Deep Learning, Image Caption Generation has seen
tremendous progress over the last few years. Most methods use transfer learning
to extract visual information, in the form of image features, with the help of
pre-trained Convolutional Neural Network models followed by transformation of
the visual information using a Caption Generator module to generate the output
sentences. Different methods have used different Convolutional Neural Network
Architectures and, to the best of our knowledge, there is no systematic study
which compares the relative efficacy of different Convolutional Neural Network
architectures for extracting the visual information. In this work, we have
evaluated 17 different Convolutional Neural Networks on two popular Image
Caption Generation frameworks: the first based on Neural Image Caption (NIC)
generation model and the second based on Soft-Attention framework. We observe
that model complexity of Convolutional Neural Network, as measured by number of
parameters, and the accuracy of the model on Object Recognition task does not
necessarily co-relate with its efficacy on feature extraction for Image Caption
Generation task.
Related papers
- Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - Graph Neural Networks for Learning Equivariant Representations of Neural Networks [55.04145324152541]
We propose to represent neural networks as computational graphs of parameters.
Our approach enables a single model to encode neural computational graphs with diverse architectures.
We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations.
arXiv Detail & Related papers (2024-03-18T18:01:01Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Recursive Neural Programs: Variational Learning of Image Grammars and
Part-Whole Hierarchies [1.5990720051907859]
We introduce Recursive Neural Programs (RNPs) to address the part-whole hierarchy learning problem.
RNPs are the first neural generative model to address the part-whole hierarchy learning problem.
Our results show that RNPs provide an intuitive and explainable way of composing objects and scenes.
arXiv Detail & Related papers (2022-06-16T22:02:06Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - Towards Learning a Vocabulary of Visual Concepts and Operators using
Deep Neural Networks [0.0]
We analyze the learned feature maps of trained models using MNIST images for achieving more explainable predictions.
We illustrate the idea by generating visual concepts from a Variational Autoencoder trained using MNIST images.
We were able to reduce the reconstruction loss (mean square error) from an initial value of 120 without augmentation to 60 with augmentation.
arXiv Detail & Related papers (2021-09-01T16:34:57Z) - A Comparison for Anti-noise Robustness of Deep Learning Classification
Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to
Visual Transformer and Performer [27.023667473278266]
We first briefly review the development of Convolutional Neural Network and Visual Transformer in deep learning.
We then use various models of Convolutional Neural Network and Visual Transformer to conduct a series of experiments on the image dataset of tiny objects.
We discuss the problems in the classification of tiny objects and make a prospect for the classification of tiny objects in the future.
arXiv Detail & Related papers (2021-06-03T15:28:17Z) - MOGAN: Morphologic-structure-aware Generative Learning from a Single
Image [59.59698650663925]
Recently proposed generative models complete training based on only one image.
We introduce a MOrphologic-structure-aware Generative Adversarial Network named MOGAN that produces random samples with diverse appearances.
Our approach focuses on internal features including the maintenance of rational structures and variation on appearance.
arXiv Detail & Related papers (2021-03-04T12:45:23Z) - NAS-DIP: Learning Deep Image Prior with Neural Architecture Search [65.79109790446257]
Recent work has shown that the structure of deep convolutional neural networks can be used as a structured image prior.
We propose to search for neural architectures that capture stronger image priors.
We search for an improved network by leveraging an existing neural architecture search algorithm.
arXiv Detail & Related papers (2020-08-26T17:59:36Z) - Text-to-Image Generation with Attention Based Recurrent Neural Networks [1.2599533416395765]
We develop a tractable and stable caption-based image generation model.
Experimentations are performed on Microsoft datasets.
Results show that the proposed model performs better than contemporary approaches.
arXiv Detail & Related papers (2020-01-18T12:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.