Understanding Synthetic Context Extension via Retrieval Heads
- URL: http://arxiv.org/abs/2410.22316v2
- Date: Mon, 30 Dec 2024 01:48:26 GMT
- Title: Understanding Synthetic Context Extension via Retrieval Heads
- Authors: Xinyu Zhao, Fangcong Yin, Greg Durrett,
- Abstract summary: We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.
We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.
Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
- Score: 51.8869530817334
- License:
- Abstract: Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle" concepts to be retrieved and diversity of the surrounding "haystack" context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context, retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data have high overlap with retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are necessary and explain model performance, although they are not totally sufficient. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
Related papers
- FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data [13.108807408880645]
We propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents.
Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models.
arXiv Detail & Related papers (2025-01-28T18:45:07Z) - A Reality Check on Context Utilisation for Retrieval-Augmented Generation [44.54803681476863]
We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance.
The dataset is based on the task of automated claim verification, for which automated retrieval of real-world evidence is crucial.
We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results.
arXiv Detail & Related papers (2024-12-22T14:16:38Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ACER: Automatic Language Model Context Extension via Retrieval [36.40066695682234]
Current open-weight generalist long-context models are still lacking in practical long-context processing tasks.
We build an textbfautomatic data synthesis pipeline that mimics this process using short-context LMs.
The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities.
arXiv Detail & Related papers (2024-10-11T17:57:06Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Learning to Reduce: Optimal Representations of Structured Data in
Prompting Large Language Models [42.16047343029512]
Large Language Models (LLMs) have been widely used as general-purpose AI agents.
We propose a framework, Learning to Reduce, that fine-tunes a language model to generate a reduced version of an input context.
We show that our model achieves comparable accuracies in selecting the relevant evidence from an input context.
arXiv Detail & Related papers (2024-02-22T00:41:23Z) - JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance
Skill Matching [18.94748873243611]
JobSkape is a framework to generate synthetic data for skill-to-taxonomy matching.
Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings.
We present a multi-step pipeline for skill extraction and matching tasks using large language models.
arXiv Detail & Related papers (2024-02-05T17:57:26Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.