KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain
- URL: http://arxiv.org/abs/2401.08100v1
- Date: Tue, 16 Jan 2024 04:01:49 GMT
- Title: KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain
- Authors: Anh-Cuong Pham, Van-Quang Nguyen, Thi-Hong Vuong, Quang-Thuy Ha
- Abstract summary: KTVIC is a comprehensive Vietnamese Image Captioning dataset, covering a wide range of daily activities.
This dataset comprises 4,327 images and 21,635 Vietnamese captions, serving as a valuable resource for advancing image captioning in the Vietnamese language.
We conduct experiments using various deep neural networks as the baselines on our dataset, evaluating them using the standard image captioning metrics, including BLEU, METEOR, CIDEr, and ROUGE.
- Score: 3.495640663645263
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Image captioning is a crucial task with applications in a wide range of
domains, including healthcare and education. Despite extensive research on
English image captioning datasets, the availability of such datasets for
Vietnamese remains limited, with only two existing datasets. In this study, we
introduce KTVIC, a comprehensive Vietnamese Image Captioning dataset focused on
the life domain, covering a wide range of daily activities. This dataset
comprises 4,327 images and 21,635 Vietnamese captions, serving as a valuable
resource for advancing image captioning in the Vietnamese language. We conduct
experiments using various deep neural networks as the baselines on our dataset,
evaluating them using the standard image captioning metrics, including BLEU,
METEOR, CIDEr, and ROUGE. Our findings underscore the effectiveness of the
proposed dataset and its potential contributions to the field of image
captioning in the Vietnamese context.
Related papers
- Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images [1.2529442734851663]
We introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs.
In this dataset, all the images contain text and questions about the information relevant to the text in the images.
We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset.
arXiv Detail & Related papers (2024-04-29T03:17:47Z) - Orientation-Independent Chinese Text Recognition in Scene Images [61.34060587461462]
We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images.
Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
arXiv Detail & Related papers (2023-09-03T05:30:21Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in
Vietnamese [2.9649783577150837]
We introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC)
The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision.
We show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset.
arXiv Detail & Related papers (2023-05-07T02:48:47Z) - BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z) - #PraCegoVer: A Large Dataset for Image Captioning in Portuguese [6.890235464357029]
#PraCegoVer is the first large dataset for image captioning in Portuguese with freely annotated images.
A movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content.
arXiv Detail & Related papers (2021-03-21T19:55:46Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z) - UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image
Captioning [2.7528170226206443]
This paper contributes to research on Image Captioning task in terms of extending dataset to a different language - Vietnamese.
In this scope, we first build a dataset which contains manually written captions for images from Microsoft COCO dataset relating to sports played with balls.
Following that, we evaluate our dataset on deep neural network models and do comparisons with English dataset and two Vietnamese datasets.
arXiv Detail & Related papers (2020-02-01T09:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.