A Vietnamese Dataset for Evaluating Machine Reading Comprehension
- URL: http://arxiv.org/abs/2009.14725v3
- Date: Sat, 7 Nov 2020 06:40:46 GMT
- Title: A Vietnamese Dataset for Evaluating Machine Reading Comprehension
- Authors: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy
Nguyen
- Abstract summary: We present UIT-ViQuAD, a new dataset for the low-resource language as Vietnamese to evaluate machine reading comprehension models.
This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.
We conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD.
- Score: 2.7528170226206443
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Over 97 million people speak Vietnamese as their native language in the
world. However, there are few research studies on machine reading comprehension
(MRC) for Vietnamese, the task of understanding a text and answering questions
related to it. Due to the lack of benchmark datasets for Vietnamese, we present
the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the
low-resource language as Vietnamese to evaluate MRC models. This dataset
comprises over 23,000 human-generated question-answer pairs based on 5,109
passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a
new process of dataset creation for Vietnamese MRC. Our in-depth analyses
illustrate that our dataset requires abilities beyond simple reasoning like
word matching and demands single-sentence and multiple-sentence inferences.
Besides, we conduct experiments on state-of-the-art MRC methods for English and
Chinese as the first experimental models on UIT-ViQuAD. We also estimate human
performance on the dataset and compare it to the experimental results of
powerful machine learning models. As a result, the substantial differences
between human performance and the best model performance on the dataset
indicate that improvements can be made on UIT-ViQuAD in future research. Our
dataset is freely available on our website to encourage the research community
to overcome challenges in Vietnamese MRC.
Related papers
- CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks.
The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks.
In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - VLSP 2021 Shared Task: Vietnamese Machine Reading Comprehension [2.348805691644086]
This article presents details of the organization of the shared task, an overview of the methods employed by shared-task participants, and the results.
We provide a benchmark dataset named UIT-ViQuAD 2.0 for evaluating the MRC task and question answering systems for the Vietnamese language.
The UIT-ViQuAD 2.0 dataset motivates more researchers to explore Vietnamese machine reading comprehension, question answering, and question generation.
arXiv Detail & Related papers (2022-03-22T00:44:41Z) - Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages.
The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z) - Sentence Extraction-Based Machine Reading Comprehension for Vietnamese [0.2446672595462589]
We introduce the UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in Vietnamese language.
The dataset consists of comprises 23.074 question-answers based on 5.109 passages of 174 Vietnamese articles from Wikipedia.
Our experiments show that the best machine model is XLM-R$_Large, which achieves an exact match (EM) score of 85.97% and an F1-score of 88.77% on our dataset.
arXiv Detail & Related papers (2021-05-19T10:22:27Z) - COVID-19 Named Entity Recognition for Vietnamese [6.17059264011429]
We present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese.
Our dataset is annotated for the named entity recognition task with newly-defined entity types.
Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets.
arXiv Detail & Related papers (2021-04-08T16:35:34Z) - A Sentence Cloze Dataset for Chinese Machine Reading Comprehension [64.07894249743767]
We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC)
The proposed task aims to fill the right candidate sentence into the passage that has several blanks.
We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
arXiv Detail & Related papers (2020-04-07T04:09:00Z) - Enhancing lexical-based approach with external knowledge for Vietnamese
multiple-choice machine reading comprehension [2.5199066832791535]
We construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts.
We propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text.
Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model.
arXiv Detail & Related papers (2020-01-16T08:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.