QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning
- URL: http://arxiv.org/abs/2508.15854v1
- Date: Wed, 20 Aug 2025 10:29:55 GMT
- Title: QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning
- Authors: Mohammad AL-Smadi,
- Abstract summary: We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation pipeline.<n>Our system achieves an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting.
- Score: 1.0152838128195467
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents our approach and results for SubTask 1: Islamic Inheritance Reasoning at QIAS 2025, a shared task focused on evaluating Large Language Models (LLMs) in understanding and reasoning within Islamic inheritance knowledge. We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline. Our system addresses the complexities of Islamic inheritance law, including comprehending inheritance scenarios, identifying eligible heirs, applying fixed-share rules, and performing precise calculations. Our system achieved an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting. Our results demonstrate that QU-NLP achieves near state-of-the-art accuracy (85.8%), excelling especially on advanced reasoning (97.6%) where it outperforms Gemini 2.5 and OpenAI's o3. This highlights that domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in Islamic inheritance reasoning.
Related papers
- IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions [1.3052252174353483]
IslamicLegalBench is the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence.<n>Best model achieves only 68% correctness with 21% hallucination.<n>Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%.
arXiv Detail & Related papers (2026-02-02T10:30:59Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering [0.0]
We introduce FARSIQA, an end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain.<n> FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG.
arXiv Detail & Related papers (2025-10-29T15:25:34Z) - Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation [0.17592522344393483]
o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%.<n>We conduct a detailed error analysis to identify recurring failure patterns across models.<n>Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning.
arXiv Detail & Related papers (2025-09-01T03:08:10Z) - CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning [6.5255476646093316]
Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares.<n>We present a framework for solving inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS)<n>The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning.
arXiv Detail & Related papers (2025-08-30T11:03:54Z) - Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases [1.3521447196536418]
Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs.<n>Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks.<n>This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws.
arXiv Detail & Related papers (2025-08-13T10:37:58Z) - MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization [74.04867639197445]
MiroMind-M1 is a set of fully open-source RLMs built on the Qwen-2.5-based benchmarks.<n>Our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems.
arXiv Detail & Related papers (2025-07-19T16:21:23Z) - Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions [1.1883838320818292]
Large language models (LLMs) in hiring promise to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias.<n>We benchmark several state-of-the-art foundational LLMs and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching.<n>Our experiments show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups.
arXiv Detail & Related papers (2025-07-02T19:02:18Z) - RLPR: Extrapolating RLVR to General Domains without Verifiers [103.14103272635893]
We propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains.<n>We find that addressing the high variance of this noisy probability reward is crucial to make it work.<n>RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models.
arXiv Detail & Related papers (2025-06-23T02:56:36Z) - SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [118.8024915014751]
Large language models (LLMs) have demonstrated remarkable proficiency in academic disciplines such as mathematics, physics, and computer science.<n>However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks.<n>We present SuperGPQA, a benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
arXiv Detail & Related papers (2025-02-20T17:05:58Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames.
It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values.
Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.