Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm
- URL: http://arxiv.org/abs/2202.05474v1
- Date: Fri, 11 Feb 2022 06:29:25 GMT
- Title: Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm
- Authors: Muhy Eddin Za'ter, Bashar Talaftha
- Abstract summary: This paper explores methods and techniques that could enhance the performance of Arabic image captioning.
The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning.
However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The continuous increase in the use of social media and the visual content on
the internet have accelerated the research in computer vision field in general
and the image captioning task in specific. The process of generating a caption
that best describes an image is a useful task for various applications such as
it can be used in image indexing and as a hearing aid for the visually
impaired. In recent years, the image captioning task has witnessed remarkable
advances regarding both datasets and architectures, and as a result, the
captioning quality has reached an astounding performance. However, the majority
of these advances especially in datasets are targeted for English, which left
other languages such as Arabic lagging behind. Although Arabic language, being
spoken by more than 450 million people and being the most growing language on
the internet, lacks the fundamental pillars it needs to advance its image
captioning research, such as benchmarks or unified datasets. This works is an
attempt to expedite the synergy in this task by providing unified datasets and
benchmarks, while also exploring methods and techniques that could enhance the
performance of Arabic image captioning. The use of multi-task learning is
explored, alongside exploring various word representations and different
features. The results showed that the use of multi-task learning and
pre-trained word embeddings noticeably enhanced the quality of image
captioning, however the presented results shows that Arabic captioning still
lags behind when compared to the English language. The used dataset and code
are available at this link.
Related papers
- TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - UNISON: Unpaired Cross-lingual Image Captioning [17.60054750276632]
We present a novel unpaired cross-lingual method to generate image captions without relying on any caption corpus in the source or the target language.
Specifically, our method consists of two phases: (i) a cross-lingual auto-encoding process, which utilizing a sentence parallel (bitext) corpus to learn the mapping from the source to the target language in the scene graph encoding space and decode sentences in the target language, and (ii) a cross-modal unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to language modality.
arXiv Detail & Related papers (2020-10-03T06:14:06Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.