QASiNa: Religious Domain Question Answering using Sirah Nabawiyah
- URL: http://arxiv.org/abs/2310.08102v1
- Date: Thu, 12 Oct 2023 07:52:19 GMT
- Title: QASiNa: Religious Domain Question Answering using Sirah Nabawiyah
- Authors: Muhammad Razif Rizqullah (1), Ayu Purwarianti (1) and Alham Fikri Aji
(2) ((1) Bandung Institute of Technology, (2) Mohamed bin Zayed University of
Artificial Intelligence)
- Abstract summary: In Islam we strictly regulate the sources of information and who can give interpretations or tafseer for that sources.
The approach used by LLM to generate answers based on its own interpretation is similar to the concept of tafseer.
We propose the Question Answering Sirah Nabawiyah (QASiNa) dataset, a novel dataset compiled from Sirah Nabawiyah literatures in Indonesian language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Nowadays, Question Answering (QA) tasks receive significant research focus,
particularly with the development of Large Language Model (LLM) such as Chat
GPT [1]. LLM can be applied to various domains, but it contradicts the
principles of information transmission when applied to the Islamic domain. In
Islam we strictly regulates the sources of information and who can give
interpretations or tafseer for that sources [2]. The approach used by LLM to
generate answers based on its own interpretation is similar to the concept of
tafseer, LLM is neither an Islamic expert nor a human which is not permitted in
Islam. Indonesia is the country with the largest Islamic believer population in
the world [3]. With the high influence of LLM, we need to make evaluation of
LLM in religious domain. Currently, there is only few religious QA dataset
available and none of them using Sirah Nabawiyah especially in Indonesian
Language. In this paper, we propose the Question Answering Sirah Nabawiyah
(QASiNa) dataset, a novel dataset compiled from Sirah Nabawiyah literatures in
Indonesian language. We demonstrate our dataset by using mBERT [4], XLM-R [5],
and IndoBERT [6] which fine-tuned with Indonesian translation of SQuAD v2.0
[7]. XLM-R model returned the best performance on QASiNa with EM of 61.20,
F1-Score of 75.94, and Substring Match of 70.00. We compare XLM-R performance
with Chat GPT-3.5 and GPT-4 [1]. Both Chat GPT version returned lower EM and
F1-Score with higher Substring Match, the gap of EM and Substring Match get
wider in GPT-4. The experiment indicate that Chat GPT tends to give excessive
interpretations as evidenced by its higher Substring Match scores compared to
EM and F1-Score, even after providing instruction and context. This concludes
Chat GPT is unsuitable for question answering task in religious domain
especially for Islamic religion.
Related papers
- From RAG to Agentic RAG for Faithful Islamic Question Answering [12.67590523116037]
We introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers.<n>We also develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision.
arXiv Detail & Related papers (2026-01-12T13:28:28Z) - Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content [1.922162958936778]
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses.<n>We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs.<n> GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82)
arXiv Detail & Related papers (2025-10-28T14:05:55Z) - Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions [10.53116395328794]
We introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English.<n>Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought.<n>To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic queries.
arXiv Detail & Related papers (2025-08-04T07:27:26Z) - Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models [0.18846515534317265]
General-purpose large language models (LLMs) often struggle with hallucinations, where generated responses deviate from authoritative sources.
This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness.
This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs.
The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance.
arXiv Detail & Related papers (2025-03-20T13:26:30Z) - Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z) - One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks.
We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K.
We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios [29.56889133557681]
This research evaluates the performance of seven leading Large Language Models (LLMs) in sentiment analysis on a dataset derived from WhatsApp chats.
We find that while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances.
GPT-4 and GPT-4-Turbo excelled in grasping diverse linguistic inputs and managing various contextual information.
arXiv Detail & Related papers (2024-06-01T07:36:59Z) - Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom [4.142301960178498]
SwordsmanImp is the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature.
It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated.
Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions.
Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions.
arXiv Detail & Related papers (2024-04-30T12:43:53Z) - What Evidence Do Language Models Find Convincing? [94.90663008214918]
We build a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts.
We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions.
Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important.
arXiv Detail & Related papers (2024-02-19T02:15:34Z) - A RAG-based Question Answering System Proposal for Understanding Islam:
MufassirQAS LLM [0.34530027457862006]
This study uses a vector database-based Retrieval Augmented Generation (RAG) approach to enhance the accuracy and transparency of LLMs.
We created a database consisting of several open-access books that include Turkish context.
MufassirQAS and ChatGPT are also tested with sensitive questions.
arXiv Detail & Related papers (2024-01-27T10:50:11Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers?
We propose KaRR, a statistical approach to assess factual knowledge for LLMs.
Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z) - Mispronunciation Detection of Basic Quranic Recitation Rules using Deep
Learning [0.0]
In Islam, readers must apply a set of pronunciation rules called Tajweed rules to recite the Quran.
The number of Tajweed teachers is not enough nowadays for daily recitation practice for every Muslim.
We propose a solution that consists of Mel-Frequency Cepstral Coefficient (MFCC) features with Long Short-Term Memory (LSTM) neural networks which use the time series.
arXiv Detail & Related papers (2023-05-10T19:31:25Z) - Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT)
This paper systematically investigates the advantages and challenges of LLMs for MMT.
We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.