An Ensemble Model with Attention Based Mechanism for Image Captioning
- URL: http://arxiv.org/abs/2501.14828v1
- Date: Wed, 22 Jan 2025 12:28:37 GMT
- Title: An Ensemble Model with Attention Based Mechanism for Image Captioning
- Authors: Israa Al Badarneh, Bassam Hammo, Omar Al-Kadi,
- Abstract summary: In this paper, we examine transformer models, emphasizing the critical role that attention mechanisms play.
The proposed model uses a transformer encoder-decoder architecture to create textual captions and a deep learning convolutional neural network to extract features from the images.
To create the captions, we present a novel ensemble learning framework that improves the richness of the generated captions.
- Score: 1.249418440326334
- License:
- Abstract: Image captioning creates informative text from an input image by creating a relationship between the words and the actual content of an image. Recently, deep learning models that utilize transformers have been the most successful in automatically generating image captions. The capabilities of transformer networks have led to notable progress in several activities related to vision. In this paper, we thoroughly examine transformer models, emphasizing the critical role that attention mechanisms play. The proposed model uses a transformer encoder-decoder architecture to create textual captions and a deep learning convolutional neural network to extract features from the images. To create the captions, we present a novel ensemble learning framework that improves the richness of the generated captions by utilizing several deep neural network architectures based on a voting mechanism that chooses the caption with the highest bilingual evaluation understudy (BLEU) score. The proposed model was evaluated using publicly available datasets. Using the Flickr8K dataset, the proposed model achieved the highest BLEU-[1-3] scores with rates of 0.728, 0.495, and 0.323, respectively. The suggested model outperformed the latest methods in Flickr30k datasets, determined by BLEU-[1-4] scores with rates of 0.798, 0.561, 0.387, and 0.269, respectively. The model efficacy was also obtained by the Semantic propositional image caption evaluation (SPICE) metric with a scoring rate of 0.164 for the Flicker8k dataset and 0.387 for the Flicker30k. Finally, ensemble learning significantly advances the process of image captioning and, hence, can be leveraged in various applications across different domains.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification [1.7265013728931]
This paper introduces a novel framework for zero-shot learning (ZSL) to recognize new categories that are unseen during training.
We propose three strategies to enhance the model's performance to handle ZSL.
arXiv Detail & Related papers (2024-05-03T15:02:41Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - Image Search with Text Feedback by Additive Attention Compositional
Learning [1.4395184780210915]
We propose an image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks.
AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k)
arXiv Detail & Related papers (2022-03-08T02:03:49Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Improved Bengali Image Captioning via deep convolutional neural network
based encoder-decoder model [0.8793721044482612]
This paper presents an end-to-end image captioning system utilizing a multimodal architecture.
Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
arXiv Detail & Related papers (2021-02-14T16:44:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.