TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy
Qur'an Using a Post-Processed Ensemble of BERT-based Models
- URL: http://arxiv.org/abs/2206.01550v1
- Date: Fri, 3 Jun 2022 13:00:48 GMT
- Title: TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy
Qur'an Using a Post-Processed Ensemble of BERT-based Models
- Authors: Mohammed ElKomy, Amany M. Sarhan
- Abstract summary: Arabic is the language of the Holy Qur'an; the sacred text for 1.8 billion people across the world.
We propose an ensemble learning model based on Arabic variants of BERT models.
Our system achieves a Partial Reciprocal Rank (pRR) score of 56.6% on the official test set.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, we witnessed great progress in different tasks of natural
language understanding using machine learning. Question answering is one of
these tasks which is used by search engines and social media platforms for
improved user experience. Arabic is the language of the Holy Qur'an; the sacred
text for 1.8 billion people across the world. Arabic is a challenging language
for Natural Language Processing (NLP) due to its complex structures. In this
article, we describe our attempts at OSACT5 Qur'an QA 2022 Shared Task, which
is a question answering challenge on the Holy Qur'an in Arabic. We propose an
ensemble learning model based on Arabic variants of BERT models. In addition,
we perform post-processing to enhance the model predictions. Our system
achieves a Partial Reciprocal Rank (pRR) score of 56.6% on the official test
set.
Related papers
- Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers [1.2124551005857038]
This paper addresses the challenge of learning to recite the Quran for non-Arabic speakers.
We use the volunteer-based crowdsourcing genre and implement a crowdsourcing API to gather audio assets.
We have collected around 7000 Quranic recitations from a pool of 1287 participants across more than 11 non-Arabic countries.
arXiv Detail & Related papers (2024-05-04T14:29:05Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - ArabicaQA: A Comprehensive Dataset for Arabic Question Answering [13.65056111661002]
We introduce ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic.
We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus.
arXiv Detail & Related papers (2024-03-26T16:37:54Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - ORCA: A Challenging Benchmark for Arabic Language Understanding [8.9379057739817]
ORCA is a publicly available benchmark for Arabic language understanding evaluation.
To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models.
arXiv Detail & Related papers (2022-12-21T04:35:43Z) - Harnessing Multilingual Resources to Question Answering in Arabic [0.7233897166339269]
The goal of the paper is to predict answers to questions given a passage of Qur'an.
The answers are always found in the passage, so the task of the model is to predict where an answer starts and where it ends.
We make use of multilingual BERT so that we can augment the training data by using data available for languages other than Arabic.
arXiv Detail & Related papers (2022-05-16T23:28:01Z) - DTW at Qur'an QA 2022: Utilising Transfer Learning with Transformers for
Question Answering in a Low-resource Domain [10.172732008860539]
The research in machine reading comprehension has been understudied in several domains, including religious texts.
The goal of the Qur'an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur'an.
arXiv Detail & Related papers (2022-05-12T11:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.