Related papers: MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation

URL: http://arxiv.org/abs/2504.17137v1
Date: Wed, 23 Apr 2025 23:05:46 GMT
Title: MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation
Authors: Chanhee Park, Hyeonseok Moon, Chanjun Park, Heuiseok Lim,
Abstract summary: Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs)<n>We present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation.<n>MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks.
Score: 8.950307082012763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.

Related papers

Investigating the Robustness of Retrieval-Augmented Generation at the Query Level [4.3028340012580975]
Retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference.<n>Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval.
arXiv Detail & Related papers (2025-07-09T15:39:17Z)
Controlled Retrieval-augmented Context Evaluation for Long-form RAG [58.14561461943611]
Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources.<n>We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation.<n>We introduce CRUX, a framework designed to directly assess retrieval-augmented contexts.
arXiv Detail & Related papers (2025-06-24T23:17:48Z)
mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs [11.861763118322136]
We introduce mmRAG, a modular benchmark for evaluating multi-modal RAG systems.<n>Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs.<n>We follow standard information retrieval procedures to annotate document relevance and derive dataset relevance.
arXiv Detail & Related papers (2025-05-16T12:31:29Z)
SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction [20.6787276745193]
We introduce an automatic evaluation method that measures retrieval quality through the lens of information gain within the RAG framework.<n>We quantify the utility of retrieval by the extent to which it reduces semantic perplexity post-retrieval.
arXiv Detail & Related papers (2025-03-03T12:37:34Z)
Unanswerability Evaluation for Retrieval Augmented Generation [74.3022365715597]
UAEval4RAG is a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively. We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries.
arXiv Detail & Related papers (2024-12-16T19:11:55Z)
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [61.14660526363607]
We propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. RAGChecker has significantly better correlations with human judgments than other evaluation metrics. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.
arXiv Detail & Related papers (2024-08-15T10:20:54Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation [26.158785168036662]
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments.
arXiv Detail & Related papers (2024-07-01T09:09:27Z)
FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention. Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers. Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z)
Evaluation of Retrieval-Augmented Generation: A Survey [13.633909177683462]
We provide a comprehensive overview of the evaluation and benchmarks of Retrieval-Augmented Generation (RAG) systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
arXiv Detail & Related papers (2024-05-13T02:33:25Z)
RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [51.171355532527365]
Retrieval-augmented generation (RAG) can significantly improve the performance of language models (LMs) RAGGED is a framework for analyzing RAG configurations across various document-based question answering tasks.
arXiv Detail & Related papers (2024-03-14T02:26:31Z)
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.