Related papers: Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

URL: http://arxiv.org/abs/2504.14891v1
Date: Mon, 21 Apr 2025 06:39:47 GMT
Title: Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey
Authors: Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu,
Abstract summary: Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval.<n> evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components.
Score: 29.186229489968564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.

Related papers

Controlled Retrieval-augmented Context Evaluation for Long-form RAG [58.14561461943611]
Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources.<n>We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation.<n>We introduce CRUX, a framework designed to directly assess retrieval-augmented contexts.
arXiv Detail & Related papers (2025-06-24T23:17:48Z)
A Survey on Knowledge-Oriented Retrieval-Augmented Generation [45.65542434522205]
Retrieval-Augmented Generation (RAG) has gained significant attention in recent years. RAG combines large-scale retrieval systems with generative models. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge.
arXiv Detail & Related papers (2025-03-11T01:59:35Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z)
FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention. Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers. Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z)
Evaluation of Retrieval-Augmented Generation: A Survey [13.633909177683462]
We provide a comprehensive overview of the evaluation and benchmarks of Retrieval-Augmented Generation (RAG) systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
arXiv Detail & Related papers (2024-05-13T02:33:25Z)
A Survey on Retrieval-Augmented Text Generation for Large Language Models [1.4579344926652844]
Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements. This paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies.
arXiv Detail & Related papers (2024-04-17T01:27:42Z)
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z)
Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.