CRAG -- Comprehensive RAG Benchmark
- URL: http://arxiv.org/abs/2406.04744v1
- Date: Fri, 7 Jun 2024 08:43:07 GMT
- Title: CRAG -- Comprehensive RAG Benchmark
- Authors: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, Xin Luna Dong,
- Abstract summary: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge.
Existing RAG datasets do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks.
We introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search.
- Score: 58.15980697921195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.
Related papers
- ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [51.587657076291]
ChatQA 2 is a Llama3-based model designed to bridge the gap between open-access LLMs and leading proprietary models.
We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens.
We find that the state-of-the-art long-context retriever can alleviate the top-k context fragmentation issue in RAG.
arXiv Detail & Related papers (2024-07-19T17:35:47Z) - RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering [61.19126689470398]
Long-form RobustQA (LFRQA) is a new dataset covering 26K queries and large corpora across seven different domains.
We show via experiments that RAG-QA Arena and human judgments on answer quality are highly correlated.
Only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
arXiv Detail & Related papers (2024-07-19T03:02:51Z) - UniOQA: A Unified Framework for Knowledge Graph Question Answering with Large Language Models [4.627548680442906]
OwnThink is the most extensive Chinese open-domain knowledge graph introduced in recent times.
We introduce UniOQA, a unified framework that integrates two parallel approaches to question answering.
UniOQA notably advances SpCQL Logical Accuracy to 21.2% and Execution Accuracy to 54.9%, achieving the new state-of-the-art results on this benchmark.
arXiv Detail & Related papers (2024-06-04T08:36:39Z) - DuetRAG: Collaborative Retrieval-Augmented Generation [57.440772556318926]
Collaborative Retrieval-Augmented Generation framework, DuetRAG, proposed.
bootstrapping philosophy is to simultaneously integrate the domain fintuning and RAG models.
arXiv Detail & Related papers (2024-05-12T09:48:28Z) - Retrieval-Augmented Generation for AI-Generated Content: A Survey [38.50754568320154]
Retrieval-Augmented Generation (RAG) has emerged as a paradigm to address such challenges.
RAG introduces the information retrieval process, which enhances the generation process by retrieving relevant objects from available data stores.
In this paper, we comprehensively review existing efforts that integrate RAG technique into AIGC scenarios.
arXiv Detail & Related papers (2024-02-29T18:59:01Z) - CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources.
This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z) - ChatQA: Surpassing GPT-4 on Conversational QA and RAG [43.34692996785167]
We introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA)
For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models.
We present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions.
arXiv Detail & Related papers (2024-01-18T18:59:11Z) - The Chronicles of RAG: The Retriever, the Chunk and the Generator [0.0]
This paper presents good practices to implement, optimize, and evaluate RAG for the Brazilian Portuguese language.
We explore a diverse set of methods to answer questions about the first Harry Potter book.
arXiv Detail & Related papers (2024-01-15T18:25:18Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.