BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset
- URL: http://arxiv.org/abs/2205.14462v1
- Date: Sat, 28 May 2022 15:39:09 GMT
- Title: BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset
- Authors: Mohammad Faiyaz Khan, S.M. Sadiq-Ur-Rahman Shifath, Md Saiful Islam
- Abstract summary: Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
- Score: 0.5893124686141781
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As computers have become efficient at understanding visual information and
transforming it into a written representation, research interest in tasks like
automatic image captioning has seen a significant leap over the last few years.
While most of the research attention is given to the English language in a
monolingual setting, resource-constrained languages like Bangla remain out of
focus, predominantly due to a lack of standard datasets. Addressing this issue,
we present a new dataset BAN-Cap following the widely used Flickr8k dataset,
where we collect Bangla captions of the images provided by qualified
annotators. Our dataset represents a wider variety of image caption styles
annotated by trained people from different backgrounds. We present a
quantitative and qualitative analysis of the dataset and the baseline
evaluation of the recent models in Bangla image captioning. We investigate the
effect of text augmentation and demonstrate that an adaptive attention-based
model combined with text augmentation using Contextualized Word Replacement
(CWR) outperforms all state-of-the-art models for Bangla image captioning. We
also present this dataset's multipurpose nature, especially on machine
translation for Bangla-English and English-Bangla. This dataset and all the
models will be useful for further research.
Related papers
- TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Satellite Captioning: Large Language Models to Augment Labeling [0.0]
caption datasets present a much more difficult challenge due to language differences, grammar, and the time it takes for humans to generate them.
Current datasets have certainly provided many instances to work with, but it becomes problematic when a captioner may have a more limited vocabulary.
This paper aims to address this issue of potential information and communication shortcomings in caption datasets.
arXiv Detail & Related papers (2023-12-18T03:21:58Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning.
The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning.
However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Fine-Grained Image Generation from Bangla Text Description using
Attentional Generative Adversarial Network [0.0]
We propose Bangla Attentional Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage processing for high-resolution Bangla text-to-image generation.
For the first time, a fine-grained image is generated from Bangla text using attentional GAN.
arXiv Detail & Related papers (2021-09-24T05:31:01Z) - TextMage: The Automated Bangla Caption Generator Based On Deep Learning [1.2330326247154968]
TextMage is a system that is capable of understanding visual scenes that belong to the Bangladeshi geographical context.
This dataset contains 9,154 images along with two annotations for each image.
arXiv Detail & Related papers (2020-10-15T23:24:15Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.