Retrieval Head Mechanistically Explains Long-Context Factuality
- URL: http://arxiv.org/abs/2404.15574v1
- Date: Wed, 24 Apr 2024 00:24:03 GMT
- Title: Retrieval Head Mechanistically Explains Long-Context Factuality
- Authors: Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu,
- Abstract summary: We show that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads.
We show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context.
We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
- Score: 56.78951509492645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
Related papers
- Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information [16.28488243884373]
Temporal Heads are specific attention heads primarily responsible for processing temporal knowledge through circuit analysis.
We confirm that these heads are present across multiple models, though their specific locations may vary.
We expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.
arXiv Detail & Related papers (2025-02-20T04:52:05Z) - Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification [20.49185921960757]
We show that attention heads swing between attending to local and long-context information depending on the query.
We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys.
arXiv Detail & Related papers (2025-02-11T00:04:32Z) - To Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic Retrieval Augmented Generation [3.724713116252253]
Uncertainty detection metrics can reduce the number of retrieval calls by almost half, with only a slight reduction in question-answering accuracy.
Our findings suggest that uncertainty detection metrics, such as Degree Matrix Jaccard and Eccentricity, can reduce the number of retrieval calls by almost half, with only a slight reduction in question-answering accuracy.
arXiv Detail & Related papers (2025-01-16T04:56:33Z) - Analyzing Human Questioning Behavior and Causal Curiosity through Natural Queries [91.70689724416698]
We present NatQuest, a collection of 13,500 naturally occurring questions from three diverse sources.
Our analysis reveals a significant presence of causal questions (up to 42%) within the dataset.
arXiv Detail & Related papers (2024-05-30T17:55:28Z) - Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models [7.537599020279862]
We explore the effects of combinations of entities and relations on large language models (LMs)
We observe that larger LMs excel in recalling popular facts, but encounter difficulty with infrequent entity-relation pairs compared to retrievers.
We demonstrate the efficacy of our finer-grained metric and insights through an adaptive retrieval system.
arXiv Detail & Related papers (2024-02-21T03:05:50Z) - Picking the Underused Heads: A Network Pruning Perspective of Attention
Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing.
We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z) - Repetition In Repetition Out: Towards Understanding Neural Text
Degeneration from the Data Perspective [91.14291142262262]
This work presents a straightforward and fundamental explanation from the data perspective.
Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data.
Our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.
arXiv Detail & Related papers (2023-10-16T09:35:42Z) - An Overview Of Temporal Commonsense Reasoning and Acquisition [20.108317515225504]
Temporal commonsense reasoning refers to the ability to understand the typical temporal context of phrases, actions, and events.
Recent research on the performance of large language models suggests that they often take shortcuts in their reasoning and fall prey to simple linguistic traps.
arXiv Detail & Related papers (2023-07-28T01:30:15Z) - RECKONING: Reasoning through Dynamic Knowledge Encoding [51.076603338764706]
We show that language models can answer questions by reasoning over knowledge provided as part of the context.
In these situations, the model fails to distinguish the knowledge that is necessary to answer the question.
We propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters.
arXiv Detail & Related papers (2023-05-10T17:54:51Z) - Knowledge-driven Data Construction for Zero-shot Evaluation in
Commonsense Question Answering [80.60605604261416]
We propose a novel neuro-symbolic framework for zero-shot question answering across commonsense tasks.
We vary the set of language models, training regimes, knowledge sources, and data generation strategies, and measure their impact across tasks.
We show that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
arXiv Detail & Related papers (2020-11-07T22:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.