Related papers: Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

URL: http://arxiv.org/abs/2508.08287v1
Date: Mon, 04 Aug 2025 07:27:26 GMT
Title: Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions
Authors: Farah Atif, Nursultan Askarbekuly, Kareem Darwish, Monojit Choudhury,
Abstract summary: We introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English.<n>Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought.<n>To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic queries.
Score: 10.53116395328794
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for minimizing confident incorrect answers. Notably, all models exhibit a performance drop in Arabic, highlighting the limitations in religious reasoning for languages other than English. To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic jurisprudence queries. Our findings underscore the need for task-specific evaluation and cautious deployment of LLMs in religious applications.

Related papers

IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions [1.3052252174353483]
IslamicLegalBench is the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence.<n>Best model achieves only 68% correctness with 21% hallucination.<n>Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%.
arXiv Detail & Related papers (2026-02-02T10:30:59Z)
HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment [52.374772443536045]
HALF (Harm-Aware LLM Fairness) is a framework that assesses model bias in realistic applications and weighs the outcomes by harm severity.<n>We show that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
arXiv Detail & Related papers (2025-10-14T07:13:26Z)
Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms [5.076445852481039]
This study evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools.<n>Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading.
arXiv Detail & Related papers (2025-06-05T09:41:09Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
Arabic Dataset for LLM Safeguard Evaluation [62.96160492994489]
This study explores the safety of large language models (LLMs) in Arabic with its linguistic and cultural complexities.<n>We present an Arab-region-specific safety evaluation dataset consisting of 5,799 questions, including direct attacks, indirect attacks, and harmless requests with sensitive words.
arXiv Detail & Related papers (2024-10-22T14:12:43Z)
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity [27.10502683001428]
This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. Experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts.
arXiv Detail & Related papers (2024-07-24T09:48:48Z)
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic [0.0]
We introduce AraTrust, the first comprehensive trustworthiness benchmark for Large Language Models (LLMs) in Arabic. GPT-4 was the most trustworthy LLM, while open-source models, particularly AceGPT 7B and Jais 13B, struggled to achieve a score of 60% in our benchmark.
arXiv Detail & Related papers (2024-03-14T00:45:24Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Building Domain-Specific LLMs Faithful To The Islamic Worldview: Mirage or Technical Possibility? [0.0]
Large Language Models (LLMs) have demonstrated remarkable performance across numerous natural language understanding use cases. In the context of Islam and its representation, accurate and factual representation of its beliefs and teachings rooted in the Quran and Sunnah is key. This work focuses on the challenge of building domain-specific LLMs faithful to the Islamic worldview.
arXiv Detail & Related papers (2023-12-11T18:59:09Z)
QASiNa: Religious Domain Question Answering using Sirah Nabawiyah [0.0]
In Islam we strictly regulate the sources of information and who can give interpretations or tafseer for that sources. The approach used by LLM to generate answers based on its own interpretation is similar to the concept of tafseer. We propose the Question Answering Sirah Nabawiyah (QASiNa) dataset, a novel dataset compiled from Sirah Nabawiyah literatures in Indonesian language.
arXiv Detail & Related papers (2023-10-12T07:52:19Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.