Related papers: Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

URL: http://arxiv.org/abs/2406.14783v2
Date: Tue, 08 Oct 2024 15:10:42 GMT
Title: Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework
Authors: Zackary Rackauckas, Arthur Câmara, Jakub Zavrel,
Abstract summary: We propose a comprehensive framework to evaluate Retrieval-Augmented Generation (RAG) Question-Answering systems. We use Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required.
Score: 0.5897092980823265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.

Related papers

RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs) Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z)
MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation [34.66546005629471]
Large Language Models (LLMs) are essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG) MAIN-RAG is a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents.
arXiv Detail & Related papers (2024-12-31T08:07:26Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
Unanswerability Evaluation for Retrieval Augmented Generation [74.3022365715597]
UAEval4RAG is a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively. We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries.
arXiv Detail & Related papers (2024-12-16T19:11:55Z)
RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance. We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination. We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z)
Evaluating ChatGPT on Nuclear Domain-Specific Data [0.0]
This paper examines the application of ChatGPT, a large language model (LLM), for question-and-answer (Q&A) tasks in the highly specialized field of nuclear data. The primary focus is on evaluating ChatGPT's performance on a curated test dataset. The findings underscore the improvement in performance when incorporating a RAG pipeline in an LLM.
arXiv Detail & Related papers (2024-08-26T08:17:42Z)
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [61.14660526363607]
We propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. RAGChecker has significantly better correlations with human judgments than other evaluation metrics. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.
arXiv Detail & Related papers (2024-08-15T10:20:54Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems [0.0]
Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for domain-specific knowledge into user-facing chat applications. We introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples. We formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains.
arXiv Detail & Related papers (2024-06-25T20:23:15Z)
RAG-Fusion: a New Take on Retrieval-Augmented Generation [0.0]
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications.
arXiv Detail & Related papers (2024-01-31T22:06:07Z)
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z)
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems [46.522527144802076]
We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems. ARES finetunes lightweight LM judges to assess the quality of individual RAG components. We make our code and datasets publicly available on Github.
arXiv Detail & Related papers (2023-11-16T00:39:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.