Related papers: Empirical Analysis of Image Caption Generation using Deep Learning

Empirical Analysis of Image Caption Generation using Deep Learning

URL: http://arxiv.org/abs/2105.09906v1
Date: Fri, 14 May 2021 05:38:13 GMT
Title: Empirical Analysis of Image Caption Generation using Deep Learning
Authors: Aditya Bhattacharya, Eshwar Shamanna Girishekar, Padmakar Anil Deshpande
Abstract summary: We have implemented and experimented with various flavors of multi-modal image captioning networks. The goal is to analyze the performance of each approach using various evaluation metrics.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated image captioning is one of the applications of Deep Learning which involves fusion of work done in computer vision and natural language processing, and it is typically performed using Encoder-Decoder architectures. In this project, we have implemented and experimented with various flavors of multi-modal image captioning networks where ResNet101, DenseNet121 and VGG19 based CNN Encoders and Attention based LSTM Decoders were explored. We have studied the effect of beam size and the use of pretrained word embeddings and compared them to baseline CNN encoder and RNN decoder architecture. The goal is to analyze the performance of each approach using various evaluation metrics including BLEU, CIDEr, ROUGE and METEOR. We have also explored model explainability using Visual Attention Maps (VAM) to highlight parts of the images which has maximum contribution for predicting each word of the generated caption.

Related papers

Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning. We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z)
Good Representation, Better Explanation: Role of Convolutional Neural Networks in Transformer-Based Remote Sensing Image Captioning [0.6058427379240696]
We systematically evaluate twelve different convolutional neural network (CNN) architectures within a transformer-based encoder framework to assess their effectiveness in Remote Sensing Image Captioning (RSIC) The results highlight the critical role of encoder selection in improving captioning performance, demonstrating that specific CNN architectures significantly enhance the quality of generated descriptions for remote sensing images.
arXiv Detail & Related papers (2025-02-22T05:36:28Z)
A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning [0.15346678870160887]
We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs. The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels. We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks.
arXiv Detail & Related papers (2024-09-27T06:12:31Z)
Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z)
Compressed Image Captioning using CNN-based Encoder-Decoder Framework [0.0]
We develop an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models. We also do a performance comparison where we delved into the realm of pre-trained CNN models. In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" models.
arXiv Detail & Related papers (2024-04-28T03:47:48Z)
Transforming Visual Scene Graphs to Image Captions [69.13204024990672]
We propose to transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. In TSG, each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words.
arXiv Detail & Related papers (2023-05-03T15:18:37Z)
Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z)
Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system. It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone. The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z)
Deep ensembles based on Stochastic Activation Selection for Polyp Segmentation [82.61182037130406]
This work deals with medical image segmentation and in particular with accurate polyp detection and segmentation during colonoscopy examinations. Basic architecture in image segmentation consists of an encoder and a decoder. We compare some variant of the DeepLab architecture obtained by varying the decoder backbone.
arXiv Detail & Related papers (2021-04-02T02:07:37Z)
Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation [1.2183405753834562]
We propose to use Inception-ResNet Convolutional Neural Network as encoder to extract features from images. We also use Hierarchical Context based Word Embeddings for word representations and a Deep Stacked Long Term Memory network as decoder. We evaluate our proposed methods with two image captioning frameworks--Decoder and Soft Attention.
arXiv Detail & Related papers (2021-02-22T18:15:39Z)
Improved Bengali Image Captioning via deep convolutional neural network based encoder-decoder model [0.8793721044482612]
This paper presents an end-to-end image captioning system utilizing a multimodal architecture. Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
arXiv Detail & Related papers (2021-02-14T16:44:17Z)
Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers. We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content. In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z)
Image Segmentation Using Deep Learning: A Survey [58.37211170954998]
Image segmentation is a key topic in image processing and computer vision. There has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models.
arXiv Detail & Related papers (2020-01-15T21:37:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.