Related papers: TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy Qur'an Using a Post-Processed Ensemble of BERT-based Models

TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy Qur'an Using a Post-Processed Ensemble of BERT-based Models

URL: http://arxiv.org/abs/2206.01550v1
Date: Fri, 3 Jun 2022 13:00:48 GMT
Title: TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy Qur'an Using a Post-Processed Ensemble of BERT-based Models
Authors: Mohammed ElKomy, Amany M. Sarhan
Abstract summary: Arabic is the language of the Holy Qur'an; the sacred text for 1.8 billion people across the world. We propose an ensemble learning model based on Arabic variants of BERT models. Our system achieves a Partial Reciprocal Rank (pRR) score of 56.6% on the official test set.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, we witnessed great progress in different tasks of natural language understanding using machine learning. Question answering is one of these tasks which is used by search engines and social media platforms for improved user experience. Arabic is the language of the Holy Qur'an; the sacred text for 1.8 billion people across the world. Arabic is a challenging language for Natural Language Processing (NLP) due to its complex structures. In this article, we describe our attempts at OSACT5 Qur'an QA 2022 Shared Task, which is a question answering challenge on the Holy Qur'an in Arabic. We propose an ensemble learning model based on Arabic variants of BERT models. In addition, we perform post-processing to enhance the model predictions. Our system achieves a Partial Reciprocal Rank (pRR) score of 56.6% on the official test set.

Related papers

Arabizi vs LLMs: Can the Genie Understand the Language of Aladdin? [0.4751886527142778]
Arabizi is a hybrid form of Arabic that incorporates Latin characters and numbers. It poses significant challenges for machine translation due to its lack of formal structure. This research project investigates the model's performance in translating Arabizi into both Modern Standard Arabic and English.
arXiv Detail & Related papers (2025-02-28T11:37:52Z)
Cross-Language Approach for Quranic QA [1.0124625066746595]
The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide. These systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic. We adopt a cross-language approach by expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements.
arXiv Detail & Related papers (2025-01-29T07:13:27Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world. One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z)
Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers [1.2124551005857038]
This paper addresses the challenge of learning to recite the Quran for non-Arabic speakers. We use the volunteer-based crowdsourcing genre and implement a crowdsourcing API to gather audio assets. We have collected around 7000 Quranic recitations from a pool of 1287 participants across more than 11 non-Arabic countries.
arXiv Detail & Related papers (2024-05-04T14:29:05Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
ArabicaQA: A Comprehensive Dataset for Arabic Question Answering [13.65056111661002]
We introduce ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus.
arXiv Detail & Related papers (2024-03-26T16:37:54Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs) The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z)
Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z)
ORCA: A Challenging Benchmark for Arabic Language Understanding [8.9379057739817]
ORCA is a publicly available benchmark for Arabic language understanding evaluation. To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models.
arXiv Detail & Related papers (2022-12-21T04:35:43Z)
Harnessing Multilingual Resources to Question Answering in Arabic [0.7233897166339269]
The goal of the paper is to predict answers to questions given a passage of Qur'an. The answers are always found in the passage, so the task of the model is to predict where an answer starts and where it ends. We make use of multilingual BERT so that we can augment the training data by using data available for languages other than Arabic.
arXiv Detail & Related papers (2022-05-16T23:28:01Z)
DTW at Qur'an QA 2022: Utilising Transfer Learning with Transformers for Question Answering in a Low-resource Domain [10.172732008860539]
The research in machine reading comprehension has been understudied in several domains, including religious texts. The goal of the Qur'an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur'an.
arXiv Detail & Related papers (2022-05-12T11:17:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.