Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
- URL: http://arxiv.org/abs/2407.16833v2
- Date: Thu, 17 Oct 2024 17:51:19 GMT
- Title: Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
- Authors: Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky,
- Abstract summary: Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts.
We compare RAG and long-context (LC) LLMs, aiming to leverage the strengths of both.
We propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection.
- Score: 26.02167477129771
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and LC across various public datasets using three latest LLMs. Results reveal that when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage. Based on this observation, we propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. Self-Route significantly reduces the computation cost while maintaining a comparable performance to LC. Our findings provide a guideline for long-context applications of LLMs using RAG and LC.
Related papers
- LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs - No Silver Bullet for LC or RAG Routing [70.35888047551643]
We present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs.
LaRA encompasses 2,326 test cases across four practical QA task categories and three types of naturally occurring long texts.
We find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks.
arXiv Detail & Related papers (2025-02-14T08:04:22Z) - Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models [27.217391392240113]
Long-context language models (LCLMs) can process entire knowledge bases and perform retrieval and reasoning directly.
Existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts.
We introduce ICR2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers.
We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head.
arXiv Detail & Related papers (2025-01-14T16:38:33Z) - Long Context vs. RAG for LLMs: An Evaluation and Revisits [41.27137478456755]
This paper revisits recent studies on this topic, highlighting their key insights and discrepancies.
We show that LC generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions.
We also provide an in-depth discussion on this topic, highlighting the overlooked importance of context relevance in existing studies.
arXiv Detail & Related papers (2024-12-27T14:34:37Z) - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG)
We compute a factuality score that can be thresholded to yield a binary decision.
Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z) - LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering [27.114593394058144]
LongRAG is a general, dual-perspective, and robust LLM-based RAG system paradigm for LCQA.
LongRAG significantly outperforms long-context LLMs (up by 6.94%), advanced RAG (up by 6.16%), and Vanilla RAG (up by 17.25%)
arXiv Detail & Related papers (2024-10-23T17:24:58Z) - Control Large Language Models via Divide and Conquer [94.48784966256463]
This paper investigates controllable generation for large language models (LLMs) with prompt-based control, focusing on Lexically Constrained Generation (LCG)
We evaluate the performance of LLMs on satisfying lexical constraints with prompt-based control, as well as their efficacy in downstream applications.
arXiv Detail & Related papers (2024-10-06T21:20:06Z) - In Defense of RAG in the Era of Long-Context Language Models [17.397639724806364]
Retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past.
Recent studies show that long-context LLMs significantly outperform RAG in long-context applications.
We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications.
arXiv Detail & Related papers (2024-09-03T07:17:41Z) - Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.
Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.
It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.