Which Attention Heads Matter for In-Context Learning?
- URL: http://arxiv.org/abs/2502.14010v1
- Date: Wed, 19 Feb 2025 12:25:02 GMT
- Title: Which Attention Heads Matter for In-Context Learning?
- Authors: Kayo Yin, Jacob Steinhardt,
- Abstract summary: Large language models (LLMs) exhibit impressive in-context learning (ICL) capability.<n>Two different mechanisms have been proposed to explain ICL: induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task.<n>We study and compare induction heads and FV heads in 12 language models.
- Score: 41.048579134842285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) exhibit impressive in-context learning (ICL) capability, enabling them to perform new tasks using only a few demonstrations in the prompt. Two different mechanisms have been proposed to explain ICL: induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task. To better understand which of the two distinct mechanisms drives ICL, we study and compare induction heads and FV heads in 12 language models. Through detailed ablations, we discover that few-shot ICL performance depends primarily on FV heads, especially in larger models. In addition, we uncover that FV and induction heads are connected: many FV heads start as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction facilitates learning the more complex FV mechanism that ultimately drives ICL.
Related papers
- Explicit Multi-head Attention for Inter-head Interaction in Large Language Models [70.96854312026319]
Multi-head Explicit Attention (MEA) is a simple yet effective attention variant that explicitly models cross-head interaction.<n>MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence.<n>This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss.
arXiv Detail & Related papers (2026-01-27T13:45:03Z) - Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules [76.21320451720764]
We introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions.<n>Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads.<n>Our analysis reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization.
arXiv Detail & Related papers (2025-12-11T05:42:53Z) - Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning [54.12174882424842]
Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms.<n>We propose a novel interpretability framework to systematically analyze the roles and behaviors of attention heads.
arXiv Detail & Related papers (2025-12-03T10:24:34Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - KV Shifting Attention Enhances Language Modeling [10.265219156828907]
Current large language models are mainly based on decode-only structure transformers, which have great in-context learning capabilities.<n>We propose a KV shifting attention to more efficiently implement the ability of the model's induction.<n>Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling.
arXiv Detail & Related papers (2024-11-29T09:42:38Z) - Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning [12.911829891263263]
We show that even a minimal ablation of induction heads leads to ICL performance decreases of up to 32% for abstract pattern recognition tasks.
For NLP tasks, this ablation substantially decreases the model's ability to benefit from examples, bringing few-shot ICL performance close to that of zero-shot prompts.
arXiv Detail & Related papers (2024-07-09T16:29:21Z) - Identifying Semantic Induction Heads to Understand In-Context Learning [103.00463655766066]
We investigate whether attention heads encode two types of relationships between tokens present in natural languages.
We find that certain attention heads exhibit a pattern where, when attending to head tokens, they recall tail tokens and increase the output logits of those tail tokens.
arXiv Detail & Related papers (2024-02-20T14:43:39Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z) - In-context Learning and Induction Heads [5.123049926855312]
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences.
We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability.
arXiv Detail & Related papers (2022-09-24T00:43:19Z) - A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications.
In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input.
We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z) - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
of Vision-and-Language BERTs [57.74359320513427]
Methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI.
We study the differences between these two categories, and show how they can be unified under a single theoretical framework.
We conduct controlled experiments to discern the empirical differences between five V&L BERTs.
arXiv Detail & Related papers (2020-11-30T18:55:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.