Improved Bengali Image Captioning via deep convolutional neural network
based encoder-decoder model
- URL: http://arxiv.org/abs/2102.07192v1
- Date: Sun, 14 Feb 2021 16:44:17 GMT
- Title: Improved Bengali Image Captioning via deep convolutional neural network
based encoder-decoder model
- Authors: Mohammad Faiyaz Khan, S.M. Sadiq-Ur-Rahman Shifath, and Md. Saiful
Islam
- Abstract summary: This paper presents an end-to-end image captioning system utilizing a multimodal architecture.
Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
- Score: 0.8793721044482612
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Image Captioning is an arduous task of producing syntactically and
semantically correct textual descriptions of an image in natural language with
context related to the image. Existing notable pieces of research in Bengali
Image Captioning (BIC) are based on encoder-decoder architecture. This paper
presents an end-to-end image captioning system utilizing a multimodal
architecture by combining a one-dimensional convolutional neural network (CNN)
to encode sequence information with a pre-trained ResNet-50 model image encoder
for extracting region-based visual features. We investigate our approach's
performance on the BanglaLekhaImageCaptions dataset using the existing
evaluation metrics and perform a human evaluation for qualitative analysis.
Experiments show that our approach's language encoder captures the fine-grained
information in the caption, and combined with the image features, it generates
accurate and diversified caption. Our work outperforms all the existing BIC
works and achieves a new state-of-the-art (SOTA) performance by scoring 0.651
on BLUE-1, 0.572 on CIDEr, 0.297 on METEOR, 0.434 on ROUGE, and 0.357 on SPICE.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Compressed Image Captioning using CNN-based Encoder-Decoder Framework [0.0]
We develop an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models.
We also do a performance comparison where we delved into the realm of pre-trained CNN models.
In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" models.
arXiv Detail & Related papers (2024-04-28T03:47:48Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Bangla Image Caption Generation through CNN-Transformer based
Encoder-Decoder Network [0.5260346080244567]
We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images.
Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions.
arXiv Detail & Related papers (2021-10-24T13:33:23Z) - Empirical Analysis of Image Caption Generation using Deep Learning [0.0]
We have implemented and experimented with various flavors of multi-modal image captioning networks.
The goal is to analyze the performance of each approach using various evaluation metrics.
arXiv Detail & Related papers (2021-05-14T05:38:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.