Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models
- URL: http://arxiv.org/abs/2503.16581v1
- Date: Thu, 20 Mar 2025 13:26:30 GMT
- Title: Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models
- Authors: Zahra Khalila, Arbi Haza Nasution, Winda Monika, Aytug Onan, Yohei Murakami, Yasir Bin Ismail Radi, Noor Mohammad Osmani,
- Abstract summary: General-purpose large language models (LLMs) often struggle with hallucinations, where generated responses deviate from authoritative sources.<n>This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness.<n>This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs.<n>The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance.
- Score: 0.18846515534317265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate and contextually faithful responses are critical when applying large language models (LLMs) to sensitive and domain-specific tasks, such as answering queries related to quranic studies. General-purpose LLMs often struggle with hallucinations, where generated responses deviate from authoritative sources, raising concerns about their reliability in religious contexts. This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness. In this study, we investigate 13 open-source LLMs categorized into large (e.g., Llama3:70b, Gemma2:27b, QwQ:32b), medium (e.g., Gemma2:9b, Llama3:8b), and small (e.g., Llama3.2:3b, Phi3:3.8b). A Retrieval-Augmented Generation (RAG) is used to make up for the problems that come with using separate models. This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs, allowing the model to gather relevant knowledge before responding. The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance. The findings reveal that large models consistently outperform smaller models in capturing query semantics and producing accurate, contextually grounded responses. The Llama3.2:3b model, even though it is considered small, does very well on faithfulness (4.619) and relevance (4.857), showing the promise of smaller architectures that have been well optimized. This article examines the trade-offs between model size, computational efficiency, and response quality while using LLMs in domain-specific applications.
Related papers
- Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach [9.672512327395435]
Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge.<n>This paper presents a large-scale empirical study to investigate the influence of prompt template design on RAG performance.<n>Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It.
arXiv Detail & Related papers (2026-02-14T21:17:44Z) - ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z) - HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions [50.61510609116118]
HuggingR$4$ is a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection to efficiently select models.<n>It attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively.
arXiv Detail & Related papers (2025-11-24T03:13:45Z) - Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning [49.559151128219725]
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks.<n>However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness.<n>We propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets.
arXiv Detail & Related papers (2025-11-13T08:13:23Z) - Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help? [3.920564895363768]
It is unclear whether smaller LLMs and reasoning models have similar abilities.<n>This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently.<n>Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1.<n>Results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash.
arXiv Detail & Related papers (2025-10-25T18:12:41Z) - SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models [79.01078135582127]
SPELL enables scalable, label-free optimization for long-context reasoning.<n>We introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities.
arXiv Detail & Related papers (2025-09-28T13:08:10Z) - GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation [14.494723040096902]
GaRAGe is a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage.<n>Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web.
arXiv Detail & Related papers (2025-06-09T11:47:03Z) - Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.
In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.
We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z) - Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - DRS: Deep Question Reformulation With Structured Output [114.14122339938697]
Large language models (LLMs) can detect unanswerable questions, but struggle to assist users in reformulating these questions.<n>We propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs ability to assist users in reformulating questions.<n>We show that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.
arXiv Detail & Related papers (2024-11-27T02:20:44Z) - Sufficient Context: A New Lens on Retrieval Augmented Generation Systems [19.238772793096473]
Augmenting LLMs with context leads to improved performance across many applications.
We develop a new notion of sufficient context, along with a method to classify instances that have enough information to answer the query.
By stratifying errors based on context sufficiency, we find that larger models with higher baseline performance excel at answering queries when the context is sufficient.
arXiv Detail & Related papers (2024-11-09T02:13:14Z) - Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance.
The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z) - Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation [8.975024781390077]
We present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in question answering applications.
We evaluate our proposed approach on a multilingual QA dataset, finding high agreement with human answer attribution.
arXiv Detail & Related papers (2024-06-19T16:10:26Z) - Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study [61.74571814707054]
We evaluate whether every generated sentence is grounded in retrieved documents or the model's pre-training data.
Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded.
Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations.
arXiv Detail & Related papers (2024-04-10T14:50:10Z) - What Evidence Do Language Models Find Convincing? [94.90663008214918]
We build a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts.
We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions.
Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important.
arXiv Detail & Related papers (2024-02-19T02:15:34Z) - Minimizing Factual Inconsistency and Hallucination in Large Language
Models [0.16417409087671928]
Large Language Models (LLMs) are widely used in critical fields such as healthcare, education, and finance.
We propose a multi-stage framework that generates the rationale first, verifies and refines incorrect ones, and uses them as supporting references to generate the answer.
Our framework improves traditional Retrieval Augmented Generation (RAG) by enabling OpenAI GPT-3.5-turbo to be 14-25% more faithful and 16-22% more accurate on two datasets.
arXiv Detail & Related papers (2023-11-23T09:58:39Z) - Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models [54.55088169443828]
Chain-of-Noting (CoN) is a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios.
CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.
arXiv Detail & Related papers (2023-11-15T18:54:53Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering [40.86455734818704]
Few-shot learning for open domain multi-hop question answering typically relies on the incontext learning capability of large language models.
We propose a data synthesis framework for multi-hop question answering that requires less than 10 human annotated question answer pairs.
arXiv Detail & Related papers (2023-05-23T04:57:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.