UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in
Vietnamese
- URL: http://arxiv.org/abs/2305.04166v2
- Date: Tue, 9 May 2023 12:46:06 GMT
- Title: UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in
Vietnamese
- Authors: Doanh C. Bui, Nghia Hieu Nguyen, Khang Nguyen
- Abstract summary: We introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC)
The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision.
We show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset.
- Score: 2.9649783577150837
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Image Captioning is one of the vision-language tasks that still interest the
research community worldwide in the 2020s. MS-COCO Caption benchmark is
commonly used to evaluate the performance of advanced captioning models,
although it was published in 2015. Recent captioning models trained on the
MS-COCO Caption dataset only have good performance in language patterns of
English; they do not have such good performance in contexts captured in Vietnam
or fluently caption images using Vietnamese. To contribute to the low-resources
research community as in Vietnam, we introduce a novel image captioning dataset
in Vietnamese, the Open-domain Vietnamese Image Captioning dataset
(UIT-OpenViIC). The introduced dataset includes complex scenes captured in
Vietnam and manually annotated by Vietnamese under strict rules and
supervision. In this paper, we present in more detail the dataset creation
process. From preliminary analysis, we show that our dataset is challenging to
recent state-of-the-art (SOTA) Transformer-based baselines, which performed
well on the MS COCO dataset. Then, the modest results prove that UIT-OpenViIC
has room to grow, which can be one of the standard benchmarks in Vietnamese for
the research community to evaluate their captioning models. Furthermore, we
present a CAMO approach that effectively enhances the image representation
ability by a multi-level encoder output fusion mechanism, which helps improve
the quality of generated captions compared to previous captioning models.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images [1.2529442734851663]
We introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs.
In this dataset, all the images contain text and questions about the information relevant to the text in the images.
We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset.
arXiv Detail & Related papers (2024-04-29T03:17:47Z) - KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain [3.495640663645263]
KTVIC is a comprehensive Vietnamese Image Captioning dataset, covering a wide range of daily activities.
This dataset comprises 4,327 images and 21,635 Vietnamese captions, serving as a valuable resource for advancing image captioning in the Vietnamese language.
We conduct experiments using various deep neural networks as the baselines on our dataset, evaluating them using the standard image captioning metrics, including BLEU, METEOR, CIDEr, and ROUGE.
arXiv Detail & Related papers (2024-01-16T04:01:49Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - VieSum: How Robust Are Transformer-based Models on Vietnamese
Summarization? [1.1379578593538398]
We investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization.
We validate the performance of the methods on two Vietnamese datasets.
arXiv Detail & Related papers (2021-10-08T17:10:31Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension [2.7528170226206443]
We present UIT-ViQuAD, a new dataset for the low-resource language as Vietnamese to evaluate machine reading comprehension models.
This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.
We conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD.
arXiv Detail & Related papers (2020-09-30T15:06:56Z) - UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image
Captioning [2.7528170226206443]
This paper contributes to research on Image Captioning task in terms of extending dataset to a different language - Vietnamese.
In this scope, we first build a dataset which contains manually written captions for images from Microsoft COCO dataset relating to sports played with balls.
Following that, we evaluate our dataset on deep neural network models and do comparisons with English dataset and two Vietnamese datasets.
arXiv Detail & Related papers (2020-02-01T09:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.