Related papers: Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage

Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage

URL: http://arxiv.org/abs/2410.15531v1
Date: Sun, 20 Oct 2024 22:59:34 GMT
Title: Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage
Authors: Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu,
Abstract summary: We introduce a novel framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We use this framework to evaluate three commercial generative answer engines: You.com, Perplexity AI, and Bing Chat. We find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions.
Score: 74.70255719194819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We propose decomposing questions into sub-questions and classifying them into three types -- core, background, and follow-up -- to reflect their roles and importance. Using this categorization, we introduce a fine-grained evaluation protocol that provides insights into the retrieval and generation characteristics of RAG systems, including three commercial generative answer engines: You.com, Perplexity AI, and Bing Chat. Interestingly, we find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions, revealing clear opportunities for improvement. Further, sub-question coverage metrics prove effective for ranking responses, achieving 82% accuracy compared to human preference annotations. Lastly, we also demonstrate that leveraging core sub-questions enhances both retrieval and answer generation in a RAG system, resulting in a 74% win rate over the baseline that lacks sub-questions.

Related papers

The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations. This approach was originally developed for the TREC Question Answering (QA) Track in 2003. We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z)
On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems [5.69361786082969]
Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) We evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that.
arXiv Detail & Related papers (2025-02-20T17:34:34Z)
Unanswerability Evaluation for Retrieval Augmented Generation [74.3022365715597]
UAEval4RAG is a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively. We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries.
arXiv Detail & Related papers (2024-12-16T19:11:55Z)
RAG-based Question Answering over Heterogeneous Data and Text [23.075485587443485]
This article presents the QUASAR system for question answering over unstructured text, structured tables, and knowledge graphs. The system adopts a RAG-based architecture, with a pipeline of evidence retrieval followed by answer generation, with the latter powered by a moderate-sized language model. Experiments with three different benchmarks demonstrate the high answering quality of our approach, being on par with or better than large GPT models.
arXiv Detail & Related papers (2024-12-10T11:18:29Z)
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation [68.81271028921647]
We introduce CORAL, a benchmark designed to assess RAG systems in realistic multi-turn conversational settings. CORAL includes diverse information-seeking conversations automatically derived from Wikipedia. It supports three core tasks of conversational RAG: passage retrieval, response generation, and citation labeling.
arXiv Detail & Related papers (2024-10-30T15:06:32Z)
CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity [23.48167670445722]
Retrieval-Augmented Generation (RAG) aims to generate more accurate and reliable answers with the help of the retrieved context from external knowledge sources. evaluating these systems remains a crucial research area due to the following issues. We propose a Comprehensive Full-chain Evaluation (CoFE-RAG) framework to facilitate thorough evaluation across the entire RAG pipeline.
arXiv Detail & Related papers (2024-10-16T05:20:32Z)
Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation [53.285436927963865]
This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems.
arXiv Detail & Related papers (2024-09-17T23:10:04Z)
RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation [35.981443744108255]
We propose a novel RAG framework, namely RichRAG. It includes a sub-aspect explorer to identify potential sub-aspects of input questions, a retriever to build a candidate pool of diverse external documents related to these sub-aspects, and a generative list-wise ranker. Experimental results on two publicly available datasets prove that our framework effectively and efficiently provides comprehensive and satisfying responses to users.
arXiv Detail & Related papers (2024-06-18T12:52:51Z)
Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy [66.95501113584541]
Utility and topical relevance are critical measures in information retrieval. We propose an Iterative utiliTy judgmEnt fraMework to promote each step of the cycle of Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-06-17T07:52:42Z)
Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check [25.63538452425097]
We propose a conversation-level RAG approach, which incorporates fine-grained retrieval augmentation and self-check for conversational question answering. In particular, our approach consists of three components, namely conversational question refiner, fine-grained retriever and self-check based response generator.
arXiv Detail & Related papers (2024-03-27T04:20:18Z)
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z)
The Power of Noise: Redefining Retrieval for RAG Systems [19.387105120040157]
Retrieval-Augmented Generation (RAG) has emerged as a method to extend beyond the pre-trained knowledge of Large Language Models. We focus on the type of passages IR systems within a RAG solution should retrieve.
arXiv Detail & Related papers (2024-01-26T14:14:59Z)
Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue Systems [71.33737787564966]
End-to-end (E2E) task-oriented dialogue (ToD) systems are prone to fall into the so-called 'likelihood trap' We propose a reranking method which aims to select high-quality items from the lists of responses initially overgenerated by the system. Our methods improve a state-of-the-art E2E ToD system by 2.4 BLEU, 3.2 ROUGE, and 2.8 METEOR scores, achieving new peak results.
arXiv Detail & Related papers (2022-11-07T15:59:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.