AmQA: Amharic Question Answering Dataset
- URL: http://arxiv.org/abs/2303.03290v2
- Date: Thu, 16 Nov 2023 12:47:29 GMT
- Title: AmQA: Amharic Question Answering Dataset
- Authors: Tilahun Abedissa, Ricardo Usbeck, Yaregal Assabie
- Abstract summary: Question Answering (QA) returns concise answers or answer lists from natural language text given a context document.
There is no published or publicly available Amharic QA dataset.
We crowdsourced 2628 question-answer pairs over 378 Wikipedia articles.
- Score: 8.509075718695492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Question Answering (QA) returns concise answers or answer lists from natural
language text given a context document. Many resources go into curating QA
datasets to advance robust models' development. There is a surge of QA datasets
for languages like English, however, this is not true for Amharic. Amharic, the
official language of Ethiopia, is the second most spoken Semitic language in
the world. There is no published or publicly available Amharic QA dataset.
Hence, to foster the research in Amharic QA, we present the first Amharic QA
(AmQA) dataset. We crowdsourced 2628 question-answer pairs over 378 Wikipedia
articles. Additionally, we run an XLMR Large-based baseline model to spark
open-domain QA research interest. The best-performing baseline achieves an
F-score of 69.58 and 71.74 in reader-retriever QA and reading comprehension
settings respectively.
Related papers
- Multilingual Non-Factoid Question Answering with Silver Answers [36.31301773167754]
This work presents MuNfQuAD, a multilingual QuAD with non-factoid questions.
It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers.
The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages.
arXiv Detail & Related papers (2024-08-20T07:37:06Z) - CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a collection of 1.5K culturally specific questions spanning 23 languages and 51 culturally translated questions from English into 22 other languages.
We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-studied languages such as Fijian and Kirundi.
Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z) - UQA: Corpus for Urdu Question Answering [3.979019316355144]
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu.
UQA is generated by translating the Stanford Question Answering dataset (SQuAD2.0), a large-scale English QA dataset.
The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T.
arXiv Detail & Related papers (2024-05-02T16:44:31Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Building Efficient and Effective OpenQA Systems for Low-Resource Languages [17.64851283209797]
We show that effective, low-cost OpenQA systems can be developed for low-resource contexts.
Key ingredients are weak supervision using machine-translated labeled datasets and a relevant unstructured knowledge source.
We present SQuAD-TR, a machine translation of SQuAD2.0, and we build our OpenQA system by adapting ColBERT-QA and retraining it over Turkish resources.
arXiv Detail & Related papers (2024-01-07T22:11:36Z) - Fully Authentic Visual Question Answering Dataset from Online Communities [72.0524198499719]
Visual Question Answering (VQA) entails answering questions about images.
We introduce the first VQA dataset in which all contents originate from an authentic use case.
We characterize this dataset and how it relates to eight mainstream VQA datasets.
arXiv Detail & Related papers (2023-11-27T06:19:00Z) - IfQA: A Dataset for Open-domain Question Answering under Counterfactual
Presuppositions [54.23087908182134]
We introduce the first large-scale counterfactual open-domain question-answering (QA) benchmarks, named IfQA.
The IfQA dataset contains over 3,800 questions that were annotated by crowdworkers on relevant Wikipedia passages.
The unique challenges posed by the IfQA benchmark will push open-domain QA research on both retrieval and counterfactual reasoning fronts.
arXiv Detail & Related papers (2023-05-23T12:43:19Z) - JaQuAD: Japanese Question Answering Dataset for Machine Reading
Comprehension [0.0]
We present the Japanese Question Answering dataset, JaQuAD, which is annotated by humans.
JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.
We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set.
arXiv Detail & Related papers (2022-02-03T18:40:25Z) - QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia
and Wikidata Translated by Native Speakers [68.9964449363406]
We extend one of the most popular KGQA benchmarks - QALD-9 by introducing high-quality questions' translations to 8 languages.
Five of the languages - Armenian, Ukrainian, Lithuanian, Bashkir and Belarusian - to our best knowledge were never considered in KGQA research community before.
arXiv Detail & Related papers (2022-01-31T22:19:55Z) - PerCQA: Persian Community Question Answering Dataset [2.503043323723241]
Community Question Answering (CQA) forums provide answers for many real-life questions.
We present PerCQA, the first Persian dataset for CQA.
This dataset contains the questions and answers crawled from the most well-known Persian forum.
arXiv Detail & Related papers (2021-12-25T14:06:41Z) - Cross-Lingual GenQA: A Language-Agnostic Generative Question Answering
Approach for Open-Domain Question Answering [76.99585451345702]
Open-Retrieval Generative Question Answering (GenQA) is proven to deliver high-quality, natural-sounding answers in English.
We present the first generalization of the GenQA approach for the multilingual environment.
arXiv Detail & Related papers (2021-10-14T04:36:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.