Related papers: RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

URL: http://arxiv.org/abs/2511.04502v1
Date: Thu, 06 Nov 2025 16:22:52 GMT
Title: RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG
Authors: Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere,
Abstract summary: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence.<n>Existing evaluation frameworks often rely on metrics that fail to capture domain-specific nuances.<n>This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.

Related papers

Domain-Specific Data Generation Framework for RAG Adaptation [58.20906914537952]
Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models with external retrieval to enable domain-grounded responses.<n>We propose RAGen, a framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches.
arXiv Detail & Related papers (2025-10-13T09:59:49Z)
Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework [2.102846336724103]
Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses.<n>This work introduces a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation.
arXiv Detail & Related papers (2025-08-26T11:16:14Z)
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation [52.3707788779464]
We introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD)<n>ARC-JSD enables efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling.<n> Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements.
arXiv Detail & Related papers (2025-05-22T09:04:03Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
Unanswerability Evaluation for Retrieval Augmented Generation [74.3022365715597]
UAEval4RAG is a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively.<n>We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries.
arXiv Detail & Related papers (2024-12-16T19:11:55Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems [0.0]
Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for domain-specific knowledge into user-facing chat applications.<n>We introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples.<n>We formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains.
arXiv Detail & Related papers (2024-06-25T20:23:15Z)
Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework [0.5897092980823265]
We propose a comprehensive framework to evaluate Retrieval-Augmented Generation (RAG) Question-Answering systems. We use Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required.
arXiv Detail & Related papers (2024-06-20T23:20:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.