RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2601.04196v1
- Date: Wed, 03 Dec 2025 07:42:49 GMT
- Title: RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
- Authors: Keerthana Murugaraj, Salima Lamsiyah, Martin Theobald,
- Abstract summary: RAGVUE is a framework for evaluation of Retrieval-Augmented Generation (RAG) systems.<n>It decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration.<n> RAGVUE supports both manual metric selection and fully automated agentic evaluation.
- Score: 1.564663326217051
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub
Related papers
- Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries [53.99620546358492]
Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete.<n>Existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions.<n>We present the first pipeline for automatic, difficulty-controlled creation of un$underlinec$heatable, $underliner$ealistic, $underlineu$nanswerable, and $underlinem$ulti-hop.
arXiv Detail & Related papers (2025-10-13T21:38:04Z) - MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation [8.950307082012763]
Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs)<n>We present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation.<n>MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks.
arXiv Detail & Related papers (2025-04-23T23:05:46Z) - Unanswerability Evaluation for Retrieval Augmented Generation [74.3022365715597]
UAEval4RAG is a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively.<n>We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries.
arXiv Detail & Related papers (2024-12-16T19:11:55Z) - Toward General Instruction-Following Alignment for Retrieval-Augmented Generation [63.611024451010316]
Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems.
We propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems.
arXiv Detail & Related papers (2024-10-12T16:30:51Z) - RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [61.14660526363607]
We propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules.
RAGChecker has significantly better correlations with human judgments than other evaluation metrics.
The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.
arXiv Detail & Related papers (2024-08-15T10:20:54Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems [0.0]
Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for domain-specific knowledge into user-facing chat applications.<n>We introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples.<n>We formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains.
arXiv Detail & Related papers (2024-06-25T20:23:15Z) - Evaluating Retrieval Quality in Retrieval-Augmented Generation [21.115495457454365]
Traditional end-to-end evaluation methods are computationally expensive.
We propose eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system.
eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
arXiv Detail & Related papers (2024-04-21T21:22:28Z) - CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources.
This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z) - ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems [46.522527144802076]
We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems.
ARES finetunes lightweight LM judges to assess the quality of individual RAG components.
We make our code and datasets publicly available on Github.
arXiv Detail & Related papers (2023-11-16T00:39:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.