Related papers: From RAG to Agentic RAG for Faithful Islamic Question Answering

From RAG to Agentic RAG for Faithful Islamic Question Answering

URL: http://arxiv.org/abs/2601.07528v1
Date: Mon, 12 Jan 2026 13:28:28 GMT
Title: From RAG to Agentic RAG for Faithful Islamic Question Answering
Authors: Gagan Bhatia, Hamdy Mubarak, Mustafa Jarrar, George Mikros, Fadi Zaraket, Mahmoud Alhirthani, Mutaz Al-Khatib, Logan Cochrane, Kareem Darwish, Rashid Yahiaoui, Firoj Alam,
Abstract summary: We introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers.<n>We also develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision.
Score: 12.67590523116037
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur'an retrieval corpus of $\sim$6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.

Related papers

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering [0.0]
We introduce FARSIQA, an end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain.<n> FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG.
arXiv Detail & Related papers (2025-10-29T15:25:34Z)
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content [1.922162958936778]
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses.<n>We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs.<n> GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82)
arXiv Detail & Related papers (2025-10-28T14:05:55Z)
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants [7.228273711234901]
Large Language Models (LLMs) are increasingly used to answer everyday questions.<n>Their performance on culturally grounded and dialectal content remains uneven across languages.<n>We propose a comprehensive method that translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects.
arXiv Detail & Related papers (2025-10-28T11:52:51Z)
AURA Score: A Metric For Holistic Audio Question Answering Evaluation [57.042210272137396]
We introduce AQEval to enable systematic benchmarking of AQA metrics.<n>It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance.<n>Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment.<n>Third, we propose a new metric - AURA score, to better evaluate open-ended model responses.
arXiv Detail & Related papers (2025-10-06T15:41:34Z)
HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark [54.73504952691398]
We set out to deliver a Hebrew Machine Reading dataset as extractive Questioning.<n>The morphologically rich nature of Hebrew poses a challenge to this endeavor.<n>We devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics.
arXiv Detail & Related papers (2025-08-03T15:53:01Z)
Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models [0.18846515534317265]
General-purpose large language models (LLMs) often struggle with hallucinations, where generated responses deviate from authoritative sources.<n>This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness.<n>This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs.<n>The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance.
arXiv Detail & Related papers (2025-03-20T13:26:30Z)
Cross-Language Approach for Quranic QA [1.0124625066746595]
The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide.<n>These systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic.<n>We adopt a cross-language approach by expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements.
arXiv Detail & Related papers (2025-01-29T07:13:27Z)
DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework. DARA effectively parses questions into formal queries through a dual mechanism. We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack [93.50174324435321]
We present Twin Answer Sentences Attack (TASA), an adversarial attack method for question answering (QA) models. TASA produces fluent and grammatical adversarial contexts while maintaining gold answers.
arXiv Detail & Related papers (2022-10-27T07:16:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.