UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image
Captioning
- URL: http://arxiv.org/abs/2002.00175v1
- Date: Sat, 1 Feb 2020 09:26:07 GMT
- Title: UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image
Captioning
- Authors: Quan Hoang Lam, Quang Duy Le, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
- Abstract summary: This paper contributes to research on Image Captioning task in terms of extending dataset to a different language - Vietnamese.
In this scope, we first build a dataset which contains manually written captions for images from Microsoft COCO dataset relating to sports played with balls.
Following that, we evaluate our dataset on deep neural network models and do comparisons with English dataset and two Vietnamese datasets.
- Score: 2.7528170226206443
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Image Captioning, the task of automatic generation of image captions, has
attracted attentions from researchers in many fields of computer science, being
computer vision, natural language processing and machine learning in recent
years. This paper contributes to research on Image Captioning task in terms of
extending dataset to a different language - Vietnamese. So far, there is no
existed Image Captioning dataset for Vietnamese language, so this is the
foremost fundamental step for developing Vietnamese Image Captioning. In this
scope, we first build a dataset which contains manually written captions for
images from Microsoft COCO dataset relating to sports played with balls, we
called this dataset UIT-ViIC. UIT-ViIC consists of 19,250 Vietnamese captions
for 3,850 images. Following that, we evaluate our dataset on deep neural
network models and do comparisons with English dataset and two Vietnamese
datasets built by different methods. UIT-ViIC is published on our lab website
for research purposes.
Related papers
- Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images [1.2529442734851663]
We introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs.
In this dataset, all the images contain text and questions about the information relevant to the text in the images.
We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset.
arXiv Detail & Related papers (2024-04-29T03:17:47Z) - ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images [1.2529442734851663]
We introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images.
We uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers.
arXiv Detail & Related papers (2024-04-16T15:28:30Z) - An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain [3.495640663645263]
KTVIC is a comprehensive Vietnamese Image Captioning dataset, covering a wide range of daily activities.
This dataset comprises 4,327 images and 21,635 Vietnamese captions, serving as a valuable resource for advancing image captioning in the Vietnamese language.
We conduct experiments using various deep neural networks as the baselines on our dataset, evaluating them using the standard image captioning metrics, including BLEU, METEOR, CIDEr, and ROUGE.
arXiv Detail & Related papers (2024-01-16T04:01:49Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in
Vietnamese [2.9649783577150837]
We introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC)
The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision.
We show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset.
arXiv Detail & Related papers (2023-05-07T02:48:47Z) - Sentence Extraction-Based Machine Reading Comprehension for Vietnamese [0.2446672595462589]
We introduce the UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in Vietnamese language.
The dataset consists of comprises 23.074 question-answers based on 5.109 passages of 174 Vietnamese articles from Wikipedia.
Our experiments show that the best machine model is XLM-R$_Large, which achieves an exact match (EM) score of 85.97% and an F1-score of 88.77% on our dataset.
arXiv Detail & Related papers (2021-05-19T10:22:27Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.