KazQAD: Kazakh Open-Domain Question Answering Dataset
- URL: http://arxiv.org/abs/2404.04487v1
- Date: Sat, 6 Apr 2024 03:40:36 GMT
- Title: KazQAD: Kazakh Open-Domain Question Answering Dataset
- Authors: Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, Pavel Braslavski,
- Abstract summary: KazQAD is a Kazakh open-domain question answering dataset.
It can be used in reading comprehension and full ODQA settings.
It contains just under 6,000 unique questions with extracted short answers.
- Score: 2.8158674707210136
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.
Related papers
- MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering [0.4194295877935868]
This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages.
We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples.
arXiv Detail & Related papers (2024-04-20T12:16:35Z) - Fully Authentic Visual Question Answering Dataset from Online Communities [72.0524198499719]
Visual Question Answering (VQA) entails answering questions about images.
We introduce the first VQA dataset in which all contents originate from an authentic use case.
We characterize this dataset and how it relates to eight mainstream VQA datasets.
arXiv Detail & Related papers (2023-11-27T06:19:00Z) - IfQA: A Dataset for Open-domain Question Answering under Counterfactual
Presuppositions [54.23087908182134]
We introduce the first large-scale counterfactual open-domain question-answering (QA) benchmarks, named IfQA.
The IfQA dataset contains over 3,800 questions that were annotated by crowdworkers on relevant Wikipedia passages.
The unique challenges posed by the IfQA benchmark will push open-domain QA research on both retrieval and counterfactual reasoning fronts.
arXiv Detail & Related papers (2023-05-23T12:43:19Z) - AmQA: Amharic Question Answering Dataset [8.509075718695492]
Question Answering (QA) returns concise answers or answer lists from natural language text given a context document.
There is no published or publicly available Amharic QA dataset.
We crowdsourced 2628 question-answer pairs over 378 Wikipedia articles.
arXiv Detail & Related papers (2023-03-06T17:06:50Z) - KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource
Language [0.0]
This dataset is annotated from raw story texts of Swahili low resource language.
QA datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems.
The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project.
arXiv Detail & Related papers (2022-05-04T23:53:23Z) - WikiOmnia: generative QA corpus on the whole Russian Wikipedia [0.2132096006921048]
We present the Wiki Omnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections.
The dataset includes every available article from Wikipedia for the Russian language.
The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification.
arXiv Detail & Related papers (2022-04-17T12:59:36Z) - JaQuAD: Japanese Question Answering Dataset for Machine Reading
Comprehension [0.0]
We present the Japanese Question Answering dataset, JaQuAD, which is annotated by humans.
JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.
We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set.
arXiv Detail & Related papers (2022-02-03T18:40:25Z) - QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia
and Wikidata Translated by Native Speakers [68.9964449363406]
We extend one of the most popular KGQA benchmarks - QALD-9 by introducing high-quality questions' translations to 8 languages.
Five of the languages - Armenian, Ukrainian, Lithuanian, Bashkir and Belarusian - to our best knowledge were never considered in KGQA research community before.
arXiv Detail & Related papers (2022-01-31T22:19:55Z) - ConditionalQA: A Complex Reading Comprehension Dataset with Conditional
Answers [93.55268936974971]
We describe a Question Answering dataset that contains complex questions with conditional answers.
We call this dataset ConditionalQA.
We show that ConditionalQA is challenging for many of the existing QA models, especially in selecting answer conditions.
arXiv Detail & Related papers (2021-10-13T17:16:46Z) - QAConv: Question Answering on Informative Conversations [85.2923607672282]
We focus on informative conversations including business emails, panel discussions, and work channels.
In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions.
arXiv Detail & Related papers (2021-05-14T15:53:05Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.