Related papers: Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

URL: http://arxiv.org/abs/2501.08248v1
Date: Tue, 14 Jan 2025 16:38:33 GMT
Title: Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models
Authors: Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han,
Abstract summary: Long-context language models (LCLMs) can process entire knowledge bases and perform retrieval and reasoning directly.<n>Existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts.<n>We introduce ICR2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers.<n>We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head.
Score: 27.217391392240113
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.

Related papers

A Comprehensive Survey on Long Context Language Modeling [118.5540791080351]
Long Context Language Models (LCLMs) process and analyze extensive inputs in an effective and efficient way. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively.
arXiv Detail & Related papers (2025-03-20T17:06:28Z)
U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack [9.760456105567078]
This paper introduces U-NIAH, a unified framework that systematically compares Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) Our framework incorporates multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings. Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness.
arXiv Detail & Related papers (2025-03-01T05:05:24Z)
LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing [70.35888047551643]
We present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. We find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks.
arXiv Detail & Related papers (2025-02-14T08:04:22Z)
Long Context vs. RAG for LLMs: An Evaluation and Revisits [41.27137478456755]
This paper revisits recent studies on this topic, highlighting their key insights and discrepancies.<n>We show that LC generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions.<n>We also provide an in-depth discussion on this topic, highlighting the overlooked importance of context relevance in existing studies.
arXiv Detail & Related papers (2024-12-27T14:34:37Z)
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach [26.02167477129771]
Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. We compare RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection.
arXiv Detail & Related papers (2024-07-23T20:51:52Z)
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window. We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens. Our results demonstrate that the Llama3-ChatQA-2-70B model outperforms most existing state-of-the-art models.
arXiv Detail & Related papers (2024-07-19T17:35:47Z)
Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents [25.825941077332182]
This paper is the first to investigate Large Language Models (LLMs) as in-context decision-makers under the problem of Dueling Bandits (DB)<n>We compare GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Llama 3.1, and o1-Preview against nine well-established DB algorithms.<n>We show that our top-performing LLM, GPT-4 Turbo, has the zero-shot relative decision-making ability to achieve surprisingly low weak regret.
arXiv Detail & Related papers (2024-07-02T02:18:14Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction [11.535892987373947]
Relation extraction (RE) aims to identify relations between entities mentioned in texts. Large language models (LLMs) have demonstrated impressive in-context learning abilities in various tasks. LLMs suffer from poor performances compared to most supervised fine-tuned RE methods.
arXiv Detail & Related papers (2024-04-27T07:12:52Z)
SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs [85.54906813106683]
We propose a simple yet effective framework to enhance open-domain question answering (ODQA) with large language models (LLMs) SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval (SuRe) Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches.
arXiv Detail & Related papers (2024-04-17T01:15:54Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs) Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.