Related papers: Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

URL: http://arxiv.org/abs/2510.24438v1
Date: Tue, 28 Oct 2025 14:05:55 GMT
Title: Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content
Authors: Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir,
Abstract summary: Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses.<n>We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs.<n> GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82)
Score: 1.922162958936778
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.

Related papers

IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions [1.3052252174353483]
IslamicLegalBench is the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence.<n>Best model achieves only 68% correctness with 21% hallucination.<n>Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%.
arXiv Detail & Related papers (2026-02-02T10:30:59Z)
From RAG to Agentic RAG for Faithful Islamic Question Answering [12.67590523116037]
We introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers.<n>We also develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision.
arXiv Detail & Related papers (2026-01-12T13:28:28Z)
DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People [81.63702981397408]
Given limited resources, to what extent do agents based on language models (LMs) act rationally?<n>We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior.<n>For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling)
arXiv Detail & Related papers (2025-10-23T17:57:28Z)
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale [51.41777906371754]
We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline.<n>A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic.<n>We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths.
arXiv Detail & Related papers (2025-09-17T14:19:28Z)
UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat [1.2788586581322734]
Saudi Data and AI Authority introduced the $ALLaM$ family of Arabic-focused models.<n>The most capable of these available to the public, $ALLaM-34B$, was adopted by HUMAIN, who developed and deployed HUMAIN Chat.<n>This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$.
arXiv Detail & Related papers (2025-08-24T14:32:15Z)
QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning [1.0152838128195467]
We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation pipeline.<n>Our system achieves an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting.
arXiv Detail & Related papers (2025-08-20T10:29:55Z)
Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions [10.53116395328794]
We introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English.<n>Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought.<n>To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic queries.
arXiv Detail & Related papers (2025-08-04T07:27:26Z)
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic [0.0]
We introduce AraTrust, the first comprehensive trustworthiness benchmark for Large Language Models (LLMs) in Arabic. GPT-4 was the most trustworthy LLM, while open-source models, particularly AceGPT 7B and Jais 13B, struggled to achieve a score of 60% in our benchmark.
arXiv Detail & Related papers (2024-03-14T00:45:24Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
QASiNa: Religious Domain Question Answering using Sirah Nabawiyah [0.0]
In Islam we strictly regulate the sources of information and who can give interpretations or tafseer for that sources. The approach used by LLM to generate answers based on its own interpretation is similar to the concept of tafseer. We propose the Question Answering Sirah Nabawiyah (QASiNa) dataset, a novel dataset compiled from Sirah Nabawiyah literatures in Indonesian language.
arXiv Detail & Related papers (2023-10-12T07:52:19Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4. The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.