Related papers: Image to Bengali Caption Generation Using Deep CNN and Bidirectional Gated Recurrent Unit

Image to Bengali Caption Generation Using Deep CNN and Bidirectional Gated Recurrent Unit

URL: http://arxiv.org/abs/2012.12139v1
Date: Tue, 22 Dec 2020 16:22:02 GMT
Title: Image to Bengali Caption Generation Using Deep CNN and Bidirectional Gated Recurrent Unit
Authors: Al Momin Faruk, Hasan Al Faraby, Md. Muzahidul Azad, Md. Riduyan Fedous, Md. Kishor Morol
Abstract summary: There is very little notable research on generating descriptions of the Bengali language. About 243 million people speak in Bengali, and it is the 7th most spoken language on the planet. This paper used an encoder-decoder approach to generate captions.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is very little notable research on generating descriptions of the Bengali language. About 243 million people speak in Bengali, and it is the 7th most spoken language on the planet. The purpose of this research is to propose a CNN and Bidirectional GRU based architecture model that generates natural language captions in the Bengali language from an image. Bengali people can use this research to break the language barrier and better understand each other's perspectives. It will also help many blind people with their everyday lives. This paper used an encoder-decoder approach to generate captions. We used a pre-trained Deep convolutional neural network (DCNN) called InceptonV3image embedding model as the encoder for analysis, classification, and annotation of the dataset's images Bidirectional Gated Recurrent unit (BGRU) layer as the decoder to generate captions. Argmax and Beam search is used to produce the highest possible quality of the captions. A new dataset called BNATURE is used, which comprises 8000 images with five captions per image. It is used for training and testing the proposed model. We obtained BLEU-1, BLEU-2, BLEU-3, BLEU-4 and Meteor is 42.6, 27.95, 23, 66, 16.41, 28.7 respectively.

Related papers

Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning. We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z)
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions [118.35194230865451]
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. We train vision-language models on KALE and demonstrate improvements on vision-language tasks.
arXiv Detail & Related papers (2024-11-12T00:52:52Z)
Indonesian Text-to-Image Synthesis with Sentence-BERT and FastGAN [0.0]
We use Sentence BERT as the text encoder and FastGAN as the image generator. We translate the CUB dataset into Bahasa using google translate and manually by humans. FastGAN uses lots of skip excitation modules and auto-encoder to generate an image with resolution 512x512x3, which is twice as bigger as the current state-of-the-art model.
arXiv Detail & Related papers (2023-03-25T16:54:22Z)
Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z)
Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network [0.5260346080244567]
We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images. Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions.
arXiv Detail & Related papers (2021-10-24T13:33:23Z)
Bornon: Bengali Image Captioning with Transformer-based Deep learning approach [0.0]
Transformer model is used to generate captions from images using English datasets. We used three different Bengali datasets to generate Bengali captions from images using the Transformer model. We compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
arXiv Detail & Related papers (2021-09-11T08:29:26Z)
Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation. In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network. We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z)
Unsupervised Transfer Learning in Multilingual Neural Machine Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language. Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English. We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z)
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition. How to effectively model linguistic rules in end-to-end deep networks remains a research challenge. We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)
Improved Bengali Image Captioning via deep convolutional neural network based encoder-decoder model [0.8793721044482612]
This paper presents an end-to-end image captioning system utilizing a multimodal architecture. Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
arXiv Detail & Related papers (2021-02-14T16:44:17Z)
Efficient Urdu Caption Generation using Attention based LSTM [0.0]
Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India. We develop an attention-based deep learning model using techniques of sequence modeling specialized for the Urdu language. We evaluate our proposed technique on this dataset and show that it can achieve a BLEU score of 0.83 in the Urdu language.
arXiv Detail & Related papers (2020-08-02T17:22:33Z)
Transform and Tell: Entity-Aware News Image Captioning [77.4898875082832]
We propose an end-to-end model which generates captions for images embedded in news articles. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts.
arXiv Detail & Related papers (2020-04-17T05:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.