Related papers: Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases

Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases

URL: http://arxiv.org/abs/2508.15796v1
Date: Wed, 13 Aug 2025 10:37:58 GMT
Title: Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases
Authors: Nouar AlDahoul, Yasir Zaki,
Abstract summary: Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs.<n>Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks.<n>This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws.
Score: 1.3521447196536418
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs. Manual calculation of shares under numerous scenarios is complex, time-consuming, and error-prone. Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks. This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws. We utilized the dataset proposed in the ArabicNLP QIAS 2025 challenge, which includes inheritance case scenarios given in Arabic and derived from Islamic legal sources. Various base and fine-tuned models, are assessed on their ability to accurately identify heirs, compute shares, and justify their reasoning in alignment with Islamic legal principles. Our analysis reveals that the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms all other models that we utilized across every difficulty level. It achieves up to 92.7% accuracy and secures the third place overall in Task 1 of the Qias 2025 challenge.

Related papers

When Do Language Models Endorse Limitations on Human Rights Principles? [82.84306700922664]
We evaluate how Large Language Models (LLMs) navigate trade-offs involving the Universal Declaration of Human Rights (UDHR)<n>Our analysis of eleven major LLMs reveals systematic biases where models accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights.
arXiv Detail & Related papers (2026-03-04T16:01:53Z)
IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions [1.3052252174353483]
IslamicLegalBench is the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence.<n>Best model achieves only 68% correctness with 21% hallucination.<n>Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%.
arXiv Detail & Related papers (2026-02-02T10:30:59Z)
DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment [52.374772443536045]
HALF (Harm-Aware LLM Fairness) is a framework that assesses model bias in realistic applications and weighs the outcomes by harm severity.<n>We show that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
arXiv Detail & Related papers (2025-10-14T07:13:26Z)
Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation [0.17592522344393483]
o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%.<n>We conduct a detailed error analysis to identify recurring failure patterns across models.<n>Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning.
arXiv Detail & Related papers (2025-09-01T03:08:10Z)
CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning [6.5255476646093316]
Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares.<n>We present a framework for solving inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS)<n>The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning.
arXiv Detail & Related papers (2025-08-30T11:03:54Z)
SoK: Large Language Model Copyright Auditing via Fingerprinting [69.14570598973195]
We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches.<n>We propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios.
arXiv Detail & Related papers (2025-08-27T12:56:57Z)
MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering [13.01152821327721]
This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic), a benchmark to evaluate large language models (LLMs)<n>The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences.<n> Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps.
arXiv Detail & Related papers (2025-08-22T13:04:43Z)
QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning [1.0152838128195467]
We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation pipeline.<n>Our system achieves an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting.
arXiv Detail & Related papers (2025-08-20T10:29:55Z)
Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks [1.3521447196536418]
This research examines the degree to which state-of-the-art large language models demonstrate and articulate healthcare knowledge in Arabic.<n>We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track.<n>Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers.
arXiv Detail & Related papers (2025-08-13T10:41:17Z)
Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions [10.53116395328794]
We introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English.<n>Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought.<n>To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic queries.
arXiv Detail & Related papers (2025-08-04T07:27:26Z)
LEXam: Benchmarking Legal Reasoning on 340 Law Exams [61.344330783528015]
LEXam is a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.
arXiv Detail & Related papers (2025-05-19T08:48:12Z)
Can Large Language Models Predict the Outcome of Judicial Decisions? [0.0]
Large Language Models (LLMs) have shown exceptional capabilities in Natural Language Processing (NLP)<n>We benchmark state-of-the-art open-source LLMs, including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations.<n>Our results demonstrate that fine-tuned smaller models achieve comparable performance to larger models in task-specific contexts.
arXiv Detail & Related papers (2025-01-15T11:32:35Z)
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset [50.36095192314595]
Large Language Models (LLMs) function as conscious agents with generalizable reasoning capabilities.<n>This ability remains underexplored due to the complexity of modeling infinite possible changes in an event.<n>We introduce the first-ever benchmark, MARS, comprising three tasks corresponding to each step.
arXiv Detail & Related papers (2024-06-04T08:35:04Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.