Bornon: Bengali Image Captioning with Transformer-based Deep learning
approach
- URL: http://arxiv.org/abs/2109.05218v1
- Date: Sat, 11 Sep 2021 08:29:26 GMT
- Title: Bornon: Bengali Image Captioning with Transformer-based Deep learning
approach
- Authors: Faisal Muhammad Shah, Mayeesha Humaira, Md Abidur Rahman Khan Jim,
Amit Saha Ami and Shimul Paul
- Abstract summary: Transformer model is used to generate captions from images using English datasets.
We used three different Bengali datasets to generate Bengali captions from images using the Transformer model.
We compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning using Encoder-Decoder based approach where CNN is used as
the Encoder and sequence generator like RNN as Decoder has proven to be very
effective. However, this method has a drawback that is sequence needs to be
processed in order. To overcome this drawback some researcher has utilized the
Transformer model to generate captions from images using English datasets.
However, none of them generated captions in Bengali using the transformer
model. As a result, we utilized three different Bengali datasets to generate
Bengali captions from images using the Transformer model. Additionally, we
compared the performance of the transformer-based model with a visual
attention-based Encoder-Decoder approach. Finally, we compared the result of
the transformer-based model with other models that employed different Bengali
image captioning datasets.
Related papers
- A Simple Text to Video Model via Transformer [4.035107857147382]
We present a general and simple text to video model based on Transformer.
Since both text and video are sequential data, we encode both texts and images into the same hidden space.
We use GPT2 and test our approach on UCF101 dataset and show it can generate promising videos.
arXiv Detail & Related papers (2023-09-26T05:26:30Z) - Cats: Complementary CNN and Transformer Encoders for Segmentation [13.288195115791758]
We propose a model with double encoders for 3D biomedical image segmentation.
We fuse the information from the convolutional encoder and the transformer, and pass it to the decoder to obtain the results.
Compared to the state-of-the-art models with and without transformers on each task, our proposed method obtains higher Dice scores across the board.
arXiv Detail & Related papers (2022-08-24T14:25:11Z) - Towards End-to-End Image Compression and Analysis with Transformers [99.50111380056043]
We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application.
We aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer.
Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.
arXiv Detail & Related papers (2021-12-17T03:28:14Z) - Bangla Image Caption Generation through CNN-Transformer based
Encoder-Decoder Network [0.5260346080244567]
We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images.
Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions.
arXiv Detail & Related papers (2021-10-24T13:33:23Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - StyTr^2: Unbiased Image Style Transfer with Transformers [59.34108877969477]
The goal of image style transfer is to render an image with artistic features guided by a style reference while maintaining the original content.
Traditional neural style transfer methods are usually biased and content leak can be observed by running several times of the style transfer process with the same reference image.
We propose a transformer-based approach, namely StyTr2, to address this critical issue.
arXiv Detail & Related papers (2021-05-30T15:57:09Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Transformer in Transformer [59.066686278998354]
We propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation.
Our TNT achieves $81.3%$ top-1 accuracy on ImageNet which is $1.5%$ higher than that of DeiT with similar computational cost.
arXiv Detail & Related papers (2021-02-27T03:12:16Z) - CPTR: Full Transformer Network for Image Captioning [15.869556479220984]
CaPtion TransformeR (CPTR) takes the sequentialized raw images as the input to Transformer.
Compared to the "CNN+Transformer" design paradigm, our model can model global context at every encoder layer from the beginning.
arXiv Detail & Related papers (2021-01-26T14:29:52Z) - Image to Bengali Caption Generation Using Deep CNN and Bidirectional
Gated Recurrent Unit [0.0]
There is very little notable research on generating descriptions of the Bengali language.
About 243 million people speak in Bengali, and it is the 7th most spoken language on the planet.
This paper used an encoder-decoder approach to generate captions.
arXiv Detail & Related papers (2020-12-22T16:22:02Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.