Bangla Image Caption Generation through CNN-Transformer based
Encoder-Decoder Network
- URL: http://arxiv.org/abs/2110.12442v1
- Date: Sun, 24 Oct 2021 13:33:23 GMT
- Title: Bangla Image Caption Generation through CNN-Transformer based
Encoder-Decoder Network
- Authors: Md Aminul Haque Palash, MD Abdullah Al Nasim, Sourav Saha, Faria
Afrin, Raisa Mallik, Sathishkumar Samiappan
- Abstract summary: We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images.
Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions.
- Score: 0.5260346080244567
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Automatic Image Captioning is the never-ending effort of creating
syntactically and validating the accuracy of textual descriptions of an image
in natural language with context. The encoder-decoder structure used throughout
existing Bengali Image Captioning (BIC) research utilized abstract image
feature vectors as the encoder's input. We propose a novel transformer-based
architecture with an attention mechanism with a pre-trained ResNet-101 model
image encoder for feature extraction from images. Experiments demonstrate that
the language decoder in our technique captures fine-grained information in the
caption and, then paired with image features, produces accurate and diverse
captions on the BanglaLekhaImageCaptions dataset. Our approach outperforms all
existing Bengali Image Captioning work and sets a new benchmark by scoring
0.694 on BLEU-1, 0.630 on BLEU-2, 0.582 on BLEU-3, and 0.337 on METEOR.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - Zero-Shot Video Captioning with Evolving Pseudo-Tokens [79.16706829968673]
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model.
The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames.
Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.
arXiv Detail & Related papers (2022-07-22T14:19:31Z) - CoCa: Contrastive Captioners are Image-Text Foundation Models [41.759438751996505]
Contrastive Captioner (CoCa) is a minimalist design to pretrain an image-text encoder-decoder foundation model.
By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead.
CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks.
arXiv Detail & Related papers (2022-05-04T07:01:14Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Bornon: Bengali Image Captioning with Transformer-based Deep learning
approach [0.0]
Transformer model is used to generate captions from images using English datasets.
We used three different Bengali datasets to generate Bengali captions from images using the Transformer model.
We compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
arXiv Detail & Related papers (2021-09-11T08:29:26Z) - Improved Bengali Image Captioning via deep convolutional neural network
based encoder-decoder model [0.8793721044482612]
This paper presents an end-to-end image captioning system utilizing a multimodal architecture.
Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
arXiv Detail & Related papers (2021-02-14T16:44:17Z) - Image to Bengali Caption Generation Using Deep CNN and Bidirectional
Gated Recurrent Unit [0.0]
There is very little notable research on generating descriptions of the Bengali language.
About 243 million people speak in Bengali, and it is the 7th most spoken language on the planet.
This paper used an encoder-decoder approach to generate captions.
arXiv Detail & Related papers (2020-12-22T16:22:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.