KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource
Language
- URL: http://arxiv.org/abs/2205.02364v3
- Date: Sun, 9 Jul 2023 14:06:02 GMT
- Title: KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource
Language
- Authors: Barack W. Wanjawa (1), Lilian D.A. Wanzare (2), Florence Indede (2),
Owen McOnyango (2), Lawrence Muchemi (1), Edward Ombui (3) ((1) University of
Nairobi Kenya, (2) Maseno University Kenya (3) Africa Nazarene University
Kenya)
- Abstract summary: This dataset is annotated from raw story texts of Swahili low resource language.
QA datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems.
The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The need for Question Answering datasets in low resource languages is the
motivation of this research, leading to the development of Kencorpus Swahili
Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story
texts of Swahili low resource language, which is a predominantly spoken in
Eastern African and in other parts of the world. Question Answering (QA)
datasets are important for machine comprehension of natural language for tasks
such as internet search and dialog systems. Machine learning systems need
training data such as the gold standard Question Answering set developed in
this research. The research engaged annotators to formulate QA pairs from
Swahili texts collected by the Kencorpus project, a Kenyan languages corpus.
The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA
pairs each, resulting into a final dataset of 7,526 QA pairs. A quality
assurance set of 12.5% of the annotated texts confirmed that the QA pairs were
all correctly annotated. A proof of concept on applying the set to the QA task
confirmed that the dataset can be usable for such tasks. KenSwQuAD has also
contributed to resourcing of the Swahili language.
Related papers
- SwaQuAD-24: QA Benchmark Dataset in Swahili [0.0]
This paper proposes the creation of a Swahili Question Answering (QA) benchmark dataset.
The dataset will focus on providing high-quality, annotated question-answer pairs that capture the linguistic diversity and complexity of Swahili.
Ethical considerations, such as data privacy, bias mitigation, and inclusivity, are central to the dataset development.
arXiv Detail & Related papers (2024-10-18T08:49:24Z) - MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [58.92057773071854]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering [0.4194295877935868]
This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages.
We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples.
arXiv Detail & Related papers (2024-04-20T12:16:35Z) - EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque [0.4499833362998487]
This work presents EuSQuAD, the first initiative dedicated to automatically translating and aligning SQuAD2.0 into Basque.
We demonstrate EuSQuAD's value through extensive qualitative analysis and QA experiments supported with EuSQuAD as training data.
arXiv Detail & Related papers (2024-04-18T13:31:57Z) - HaVQA: A Dataset for Visual Question Answering and Multimodal Research
in Hausa Language [1.3476084087665703]
HaVQA is the first multimodal dataset for visual question-answering tasks in the Hausa language.
The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset.
arXiv Detail & Related papers (2023-05-28T10:55:31Z) - Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Cross-Lingual Question Answering over Knowledge Base as Reading
Comprehension [61.079852289005025]
Cross-lingual question answering over knowledge base (xKBQA) aims to answer questions in languages different from that of the provided knowledge base.
One of the major challenges facing xKBQA is the high cost of data annotation.
We propose a novel approach for xKBQA in a reading comprehension paradigm.
arXiv Detail & Related papers (2023-02-26T05:52:52Z) - QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia
and Wikidata Translated by Native Speakers [68.9964449363406]
We extend one of the most popular KGQA benchmarks - QALD-9 by introducing high-quality questions' translations to 8 languages.
Five of the languages - Armenian, Ukrainian, Lithuanian, Bashkir and Belarusian - to our best knowledge were never considered in KGQA research community before.
arXiv Detail & Related papers (2022-01-31T22:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.