JaQuAD: Japanese Question Answering Dataset for Machine Reading
Comprehension
- URL: http://arxiv.org/abs/2202.01764v1
- Date: Thu, 3 Feb 2022 18:40:25 GMT
- Title: JaQuAD: Japanese Question Answering Dataset for Machine Reading
Comprehension
- Authors: ByungHoon So, Kyuhong Byun, Kyungwon Kang, Seongjin Cho
- Abstract summary: We present the Japanese Question Answering dataset, JaQuAD, which is annotated by humans.
JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.
We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Question Answering (QA) is a task in which a machine understands a given
document and a question to find an answer. Despite impressive progress in the
NLP area, QA is still a challenging problem, especially for non-English
languages due to the lack of annotated datasets. In this paper, we present the
Japanese Question Answering Dataset, JaQuAD, which is annotated by humans.
JaQuAD consists of 39,696 extractive question-answer pairs on Japanese
Wikipedia articles. We finetuned a baseline model which achieves 78.92% for F1
score and 63.38% for EM on test set. The dataset and our experiments are
available at https://github.com/SkelterLabsInc/JaQuAD.
Related papers
- Multilingual Non-Factoid Question Answering with Silver Answers [36.31301773167754]
This work presents MuNfQuAD, a multilingual QuAD with non-factoid questions.
It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers.
The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages.
arXiv Detail & Related papers (2024-08-20T07:37:06Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - KazQAD: Kazakh Open-Domain Question Answering Dataset [2.8158674707210136]
KazQAD is a Kazakh open-domain question answering dataset.
It can be used in reading comprehension and full ODQA settings.
It contains just under 6,000 unique questions with extracted short answers.
arXiv Detail & Related papers (2024-04-06T03:40:36Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - AmQA: Amharic Question Answering Dataset [8.509075718695492]
Question Answering (QA) returns concise answers or answer lists from natural language text given a context document.
There is no published or publicly available Amharic QA dataset.
We crowdsourced 2628 question-answer pairs over 378 Wikipedia articles.
arXiv Detail & Related papers (2023-03-06T17:06:50Z) - KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource
Language [0.0]
This dataset is annotated from raw story texts of Swahili low resource language.
QA datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems.
The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project.
arXiv Detail & Related papers (2022-05-04T23:53:23Z) - PQuAD: A Persian Question Answering Dataset [0.0]
crowdsourced reading comprehension dataset on Persian Wikipedia articles.
Includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable.
Our experiments on different state-of-the-art pre-trained contextualized language models show 74.8% Exact Match (EM) and 87.6% F1-score.
arXiv Detail & Related papers (2022-02-13T05:42:55Z) - QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia
and Wikidata Translated by Native Speakers [68.9964449363406]
We extend one of the most popular KGQA benchmarks - QALD-9 by introducing high-quality questions' translations to 8 languages.
Five of the languages - Armenian, Ukrainian, Lithuanian, Bashkir and Belarusian - to our best knowledge were never considered in KGQA research community before.
arXiv Detail & Related papers (2022-01-31T22:19:55Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - IIRC: A Dataset of Incomplete Information Reading Comprehension
Questions [53.3193258414806]
We present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia.
The questions were written by crowd workers who did not have access to any of the linked documents.
We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset.
arXiv Detail & Related papers (2020-11-13T20:59:21Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.