Related papers: In-Context Learning Without Copying

In-Context Learning Without Copying

URL: http://arxiv.org/abs/2511.05743v1
Date: Fri, 07 Nov 2025 22:11:11 GMT
Title: In-Context Learning Without Copying
Authors: Kerem Sahin, Sheridan Feucht, Adam Belfki, Jannik Brinkmann, Aaron Mueller, David Bau, Chris Wendler,
Abstract summary: We study whether transformers can still acquire in-context learning capabilities when inductive copying is suppressed.<n>We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads.<n>Mechanistic analysis shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities.
Score: 31.718993147344353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they often experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may serve as a prerequisite for more complex in-context learning (ICL) capabilities. In this work, we ask whether transformers can still acquire ICL capabilities when inductive copying is suppressed. We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads. Despite a significant reduction in inductive copying, performance on abstractive ICL tasks (i.e., tasks where the answer is not contained in the input context) remains comparable and surpasses the vanilla model on 13 of 21 tasks, even though 31.7\% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that cannot be predicted correctly by induction heads. Mechanistic analysis further shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities. Taken together, our findings indicate that inductive copying is not essential for learning abstractive ICL mechanisms.

Related papers

On the Emergence of Induction Heads for In-Context Learning [121.64612469118464]
We study the emergence of induction heads, a previously identified mechanism in two-layer transformers.<n>We explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture.
arXiv Detail & Related papers (2025-11-02T18:12:06Z)
What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples? [6.886536285117155]
We introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti) or a balanced mix into the pretraining stream.<n>We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity.
arXiv Detail & Related papers (2025-09-26T21:27:55Z)
Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models [24.666925550391024]
We identify induction heads as a key driver of the repetition curse.<n>We propose a technique with attention head regularization that could be employed to reduce the dominance of induction heads during generation.
arXiv Detail & Related papers (2025-05-17T03:09:33Z)
Mitigating Copy Bias in In-Context Learning through Neuron Pruning [74.91243772654519]
Large language models (LLMs) have demonstrated impressive few-shot in-context learning abilities. They are sometimes prone to a copying bias', where they copy answers from provided examples instead of learning the underlying patterns. We propose a novel and simple method to mitigate such copying bias.
arXiv Detail & Related papers (2024-10-02T07:18:16Z)
On the Inductive Bias of Stacking Towards Improving Reasoning [50.225873619537765]
We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%. MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks. We conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models.
arXiv Detail & Related papers (2024-09-27T17:58:21Z)
Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning [12.911829891263263]
We show that even a minimal ablation of induction heads leads to ICL performance decreases of up to 32% for abstract pattern recognition tasks.<n>For NLP tasks, this ablation substantially decreases the model's ability to benefit from examples, bringing few-shot ICL performance close to that of zero-shot prompts.
arXiv Detail & Related papers (2024-07-09T16:29:21Z)
Tripod: Three Complementary Inductive Biases for Disentangled Representation Learning [52.70210390424605]
In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. We propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks.
arXiv Detail & Related papers (2024-04-16T04:52:41Z)
Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models. The underlying mechanics behind knowledge distillation (KD) are still not fully understood. We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z)
In-context Learning and Induction Heads [5.123049926855312]
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences. We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability.
arXiv Detail & Related papers (2022-09-24T00:43:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.