Comparative study of Transformer and LSTM Network with attention
mechanism on Image Captioning
- URL: http://arxiv.org/abs/2303.02648v1
- Date: Sun, 5 Mar 2023 11:45:53 GMT
- Title: Comparative study of Transformer and LSTM Network with attention
mechanism on Image Captioning
- Authors: Pranav Dandwate, Chaitanya Shahane, Vandana Jagtap, Shridevi C.
Karande
- Abstract summary: This study compares Transformer and LSTM with attention block model on MS-COCO dataset.
Transformer and LSTM with attention block models have been discussed with state of the art accuracy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a globalized world at the present epoch of generative intelligence, most
of the manual labour tasks are automated with increased efficiency. This can
support businesses to save time and money. A crucial component of generative
intelligence is the integration of vision and language. Consequently, image
captioning become an intriguing area of research. There have been multiple
attempts by the researchers to solve this problem with different deep learning
architectures, although the accuracy has increased, but the results are still
not up to standard. This study buckles down to the comparison of Transformer
and LSTM with attention block model on MS-COCO dataset, which is a standard
dataset for image captioning. For both the models we have used pretrained
Inception-V3 CNN encoder for feature extraction of the images. The Bilingual
Evaluation Understudy score (BLEU) is used to checked the accuracy of caption
generated by both models. Along with the transformer and LSTM with attention
block models,CLIP-diffusion model, M2-Transformer model and the X-Linear
Attention model have been discussed with state of the art accuracy.
Related papers
- Efficient Machine Translation with a BiLSTM-Attention Approach [0.0]
This paper proposes a novel Seq2Seq model aimed at improving translation quality while reducing the storage space required by the model.
The model employs a Bidirectional Long Short-Term Memory network (Bi-LSTM) as the encoder to capture the context information of the input sequence.
Compared to the current mainstream Transformer model, our model achieves superior performance on the WMT14 machine translation dataset.
arXiv Detail & Related papers (2024-10-29T01:12:50Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Mug-STAN: Adapting Image-Language Pretrained Models for General Video
Understanding [47.97650346560239]
We propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN) to extend image-text model to diverse video tasks and video-text data.
Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages.
arXiv Detail & Related papers (2023-11-25T17:01:38Z) - Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed
Image Retrieval [17.70430913227593]
We introduce a novel unlabeled and pre-trained masked tuning approach to reduce the gap between the pre-trained model and the downstream CIR task.
With such a simple design, it can learn to capture fine-grained text-guided modifications.
arXiv Detail & Related papers (2023-11-13T02:49:57Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition
using a Novel Transformers-based Model and an Innovative 270 Million-Words
Multi-Font Corpus of Classical Arabic with Diacritics [0.0]
This research is the second phase in a series of investigations on developing an Optical Character Recognition (OCR) of Arabic historical documents.
We propose an end-to-end text recognition approach using Vision Transformers as an encoder, namely BEIT, and vanilla Transformer as a decoder, eliminating CNNs for feature extraction and reducing the model's complexity.
arXiv Detail & Related papers (2022-08-20T22:21:19Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.