AfriQA: Cross-lingual Open-Retrieval Question Answering for African
Languages
- URL: http://arxiv.org/abs/2305.06897v1
- Date: Thu, 11 May 2023 15:34:53 GMT
- Title: AfriQA: Cross-lingual Open-Retrieval Question Answering for African
Languages
- Authors: Odunayo Ogundepo, Tajuddeen R. Gwadabe, Clara E. Rivera, Jonathan H.
Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure F. P. Dossou,
Abdou Aziz DIOP, Claytone Sikasote, Gilles Hacheme, Happy Buzaaba, Ignatius
Ezeani, Rooweither Mabuya, Salomey Osei, Chris Emezue, Albert Njoroge Kahira,
Shamsuddeen H. Muhammad, Akintunde Oladipo, Abraham Toluwase Owodunni, Atnafu
Lambebo Tonja, Iyanuoluwa Shode, Akari Asai, Tunde Oluwaseyi Ajayi, Clemencia
Siro, Steven Arthur, Mofetoluwa Adeyemi, Orevaoghene Ahia, Aremu Anuoluwapo,
Oyinkansola Awosan, Chiamaka Chukwuneke, Bernard Opoku, Awokoya Ayodele,
Verrah Otiende, Christine Mwase, Boyd Sinkala, Andre Niyongabo Rubungo,
Daniel A. Ajisafe, Emeka Felix Onwuegbuzia, Habib Mbow, Emile Niyomutabazi,
Eunice Mukonde, Falalu Ibrahim Lawan, Ibrahim Said Ahmad, Jesujoba O. Alabi,
Martin Namukombo, Mbonu Chinedu, Mofya Phiri, Neo Putini, Ndumiso Mngoma,
Priscilla A. Amuok, Ruqayya Nasir Iro, Sonia Adhiambo34
- Abstract summary: Cross-lingual open-retrieval question answering (XOR QA) systems retrieve answer content from other languages while serving people in their native language.
We create AfriQA, the first cross-lingual QA dataset with a focus on African languages.
AfriQA includes 12,000+ XOR QA examples across 10 African languages.
- Score: 18.689806554953236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: African languages have far less in-language content available digitally,
making it challenging for question answering systems to satisfy the information
needs of users. Cross-lingual open-retrieval question answering (XOR QA)
systems -- those that retrieve answer content from other languages while
serving people in their native language -- offer a means of filling this gap.
To this end, we create AfriQA, the first cross-lingual QA dataset with a focus
on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African
languages. While previous datasets have focused primarily on languages where
cross-lingual QA augments coverage from the target language, AfriQA focuses on
languages where cross-lingual answer content is the only high-coverage source
of answer content. Because of this, we argue that African languages are one of
the most important and realistic use cases for XOR QA. Our experiments
demonstrate the poor performance of automatic translation and multilingual
retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA
models. We hope that the dataset enables the development of more equitable QA
technology.
Related papers
- INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [26.13077589552484]
Indic-QA is the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families.
We generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance.
We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages.
arXiv Detail & Related papers (2024-07-18T13:57:16Z) - CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a collection of 1.5K culturally specific questions spanning 23 languages and 51 culturally translated questions from English into 22 other languages.
We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-studied languages such as Fijian and Kirundi.
Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Bridging the Language Gap: Knowledge Injected Multilingual Question
Answering [19.768708263635176]
We propose a generalized cross-lingual transfer framework to enhance the model's ability to understand different languages.
Experiment results on real-world datasets MLQA demonstrate that the proposed method can improve the performance by a large margin.
arXiv Detail & Related papers (2023-04-06T15:41:25Z) - Cross-Lingual QA as a Stepping Stone for Monolingual Open QA in
Icelandic [0.0]
It can be challenging to build effective open question answering (open QA) systems for languages other than English.
We present a data efficient method to bootstrap such a system for languages other than English.
Our approach requires only limited QA resources in the given language, along with machine-translated data, and at least a bilingual language model.
arXiv Detail & Related papers (2022-07-05T09:52:34Z) - QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia
and Wikidata Translated by Native Speakers [68.9964449363406]
We extend one of the most popular KGQA benchmarks - QALD-9 by introducing high-quality questions' translations to 8 languages.
Five of the languages - Armenian, Ukrainian, Lithuanian, Bashkir and Belarusian - to our best knowledge were never considered in KGQA research community before.
arXiv Detail & Related papers (2022-01-31T22:19:55Z) - Cross-Lingual GenQA: A Language-Agnostic Generative Question Answering
Approach for Open-Domain Question Answering [76.99585451345702]
Open-Retrieval Generative Question Answering (GenQA) is proven to deliver high-quality, natural-sounding answers in English.
We present the first generalization of the GenQA approach for the multilingual environment.
arXiv Detail & Related papers (2021-10-14T04:36:29Z) - XOR QA: Cross-lingual Open-Retrieval Question Answering [75.20578121267411]
This work extends open-retrieval question answering to a cross-lingual setting.
We construct a large-scale dataset built on questions lacking same-language answers.
arXiv Detail & Related papers (2020-10-22T16:47:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.