Related papers: IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions

IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions

URL: http://arxiv.org/abs/2602.21226v1
Date: Mon, 02 Feb 2026 10:30:59 GMT
Title: IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions
Authors: Ezieddin Elmahjub, Junaid Qadir, Abdullah Mushtaq, Rafay Naeem, Ibrahim Ghaznavi, Waleed Iqbal,
Abstract summary: IslamicLegalBench is the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence.<n>Best model achieves only 68% correctness with 21% hallucination.<n>Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%.
Score: 1.3052252174353483
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.

Related papers

Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing [63.96040994220329]
We find that SFT-based methods, e.g., Reason-KE, suffer from a "faithfulness gap"<n>This gap enables the LLM's powerful parametric priors to override new contextual facts.<n>We propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness.
arXiv Detail & Related papers (2025-11-16T15:49:01Z)
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content [1.922162958936778]
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses.<n>We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs.<n> GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82)
arXiv Detail & Related papers (2025-10-28T14:05:55Z)
Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA [63.96040994220329]
Reason-KE steers a pretrained large language model through four structured stages-fact acknowledgment, relevance determination, selective application, and final reasoning-to filter distractors in a single pass.<n>Trained on MQuAKE-CF with up to four irrelevant facts, Reason-KE elevates QA accuracy to 90.2% while suffering merely a 6.3% drop under heavy distraction and 1% when answers are leaked.
arXiv Detail & Related papers (2025-09-01T13:37:42Z)
Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation [0.17592522344393483]
o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%.<n>We conduct a detailed error analysis to identify recurring failure patterns across models.<n>Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning.
arXiv Detail & Related papers (2025-09-01T03:08:10Z)
QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning [1.0152838128195467]
We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation pipeline.<n>Our system achieves an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting.
arXiv Detail & Related papers (2025-08-20T10:29:55Z)
Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases [1.3521447196536418]
Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs.<n>Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks.<n>This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws.
arXiv Detail & Related papers (2025-08-13T10:37:58Z)
Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions [10.53116395328794]
We introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English.<n>Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought.<n>To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic queries.
arXiv Detail & Related papers (2025-08-04T07:27:26Z)
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z)
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.<n>CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.<n>We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames. It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z)
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning [105.77733287326308]
We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
arXiv Detail & Related papers (2023-10-01T12:02:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.