Optimized Quran Passage Retrieval Using an Expanded QA Dataset and Fine-Tuned Language Models
- URL: http://arxiv.org/abs/2412.11431v1
- Date: Mon, 16 Dec 2024 04:03:58 GMT
- Title: Optimized Quran Passage Retrieval Using an Expanded QA Dataset and Fine-Tuned Language Models
- Authors: Mohamed Basem, Islam Oshallah, Baraa Hikal, Ali Hamdi, Ammar Mohamed,
- Abstract summary: The Qur'an QA 2023 shared task dataset had a limited number of questions with weak model retrieval.
The original dataset, which contains 251 questions, was reviewed and expanded to 629 questions with question diversification and reformulation.
Experiments fine-tuned transformer models, including AraBERT, RoBERTa, CAMeLBERT, AraELECTRA, and BERT.
- Score: 0.0
- License:
- Abstract: Understanding the deep meanings of the Qur'an and bridging the language gap between modern standard Arabic and classical Arabic is essential to improve the question-and-answer system for the Holy Qur'an. The Qur'an QA 2023 shared task dataset had a limited number of questions with weak model retrieval. To address this challenge, this work updated the original dataset and improved the model accuracy. The original dataset, which contains 251 questions, was reviewed and expanded to 629 questions with question diversification and reformulation, leading to a comprehensive set of 1895 categorized into single-answer, multi-answer, and zero-answer types. Extensive experiments fine-tuned transformer models, including AraBERT, RoBERTa, CAMeLBERT, AraELECTRA, and BERT. The best model, AraBERT-base, achieved a MAP@10 of 0.36 and MRR of 0.59, representing improvements of 63% and 59%, respectively, compared to the baseline scores (MAP@10: 0.22, MRR: 0.37). Additionally, the dataset expansion led to improvements in handling "no answer" cases, with the proposed approach achieving a 75% success rate for such instances, compared to the baseline's 25%. These results demonstrate the effect of dataset improvement and model architecture optimization in increasing the performance of QA systems for the Holy Qur'an, with higher accuracy, recall, and precision.
Related papers
- AmaSQuAD: A Benchmark for Amharic Extractive Question Answering [0.0]
This research presents a novel framework for translating extractive question-answering datasets into low-resource languages.
The methodology addresses challenges related to misalignment between translated questions and answers.
We fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering.
arXiv Detail & Related papers (2025-02-04T06:27:39Z) - Cross-Language Approach for Quranic QA [1.0124625066746595]
The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide.
These systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic.
We adopt a cross-language approach by expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements.
arXiv Detail & Related papers (2025-01-29T07:13:27Z) - Building a Rich Dataset to Empower the Persian Question Answering Systems [0.6138671548064356]
This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers.
BERT-based question answering model has been applied to this dataset using two pre-trained language models.
Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score.
arXiv Detail & Related papers (2024-12-28T16:53:25Z) - KET-QA: A Dataset for Knowledge Enhanced Table Question Answering [63.56707527868466]
We propose to use a knowledge base (KB) as the external knowledge source for TableQA.
Every question requires the integration of information from both the table and the sub-graph to be answered.
We design a retriever-reasoner structured pipeline model to extract pertinent information from the vast knowledge sub-graph.
arXiv Detail & Related papers (2024-05-13T18:26:32Z) - Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian [0.0]
We create the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr.
To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset.
Best results were obtained by fine-tuning the BERTi'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score.
arXiv Detail & Related papers (2024-04-12T17:27:54Z) - TCE at Qur'an QA 2023 Shared Task: Low Resource Enhanced
Transformer-based Ensemble Approach for Qur'anic QA [0.0]
We present our approach to tackle Qur'an QA 2023 shared tasks A and B.
To address the challenge of low-resourced training data, we rely on transfer learning together with a voting ensemble.
We employ different architectures and learning mechanisms for a range of Arabic pre-trained transformer-based models for both tasks.
arXiv Detail & Related papers (2024-01-23T19:32:54Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Improving Passage Retrieval with Zero-Shot Question Generation [109.11542468380331]
We propose a simple and effective re-ranking method for improving passage retrieval in open question answering.
The re-ranker re-scores retrieved passages with a zero-shot question generation model, which uses a pre-trained language model to compute the probability of the input question conditioned on a retrieved passage.
arXiv Detail & Related papers (2022-04-15T14:51:41Z) - TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and
Textual Content in Finance [71.76018597965378]
We build a new large-scale Question Answering dataset containing both Tabular And Textual data, named TAT-QA.
We propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text.
arXiv Detail & Related papers (2021-05-17T06:12:06Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z) - Logic-Guided Data Augmentation and Regularization for Consistent
Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions.
Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.