Related papers: Selective Induction Heads: How Transformers Select Causal Structures In Context

Selective Induction Heads: How Transformers Select Causal Structures In Context

URL: http://arxiv.org/abs/2509.08184v1
Date: Tue, 09 Sep 2025 23:13:41 GMT
Title: Selective Induction Heads: How Transformers Select Causal Structures In Context
Authors: Francesco D'Angelo, Francesco Croce, Nicolas Flammarion,
Abstract summary: We introduce a novel framework that showcases transformers' ability to handle causal structures.<n>Our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed.<n>This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context.
Score: 50.09964990342878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have exhibited exceptional capabilities in sequence modeling tasks, leveraging self-attention and in-context learning. Critical to this success are induction heads, attention circuits that enable copying tokens based on their previous occurrences. In this work, we introduce a novel framework that showcases transformers' ability to dynamically handle causal structures. Existing works rely on Markov Chains to study the formation of induction heads, revealing how transformers capture causal dependencies and learn transition probabilities in-context. However, they rely on a fixed causal structure that fails to capture the complexity of natural languages, where the relationship between tokens dynamically changes with context. To this end, our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed. This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context. We empirically demonstrate that transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We provide a detailed construction of a 3-layer transformer to implement the selective induction head, and a theoretical analysis proving that this mechanism asymptotically converges to the maximum likelihood solution. Our findings advance the understanding of how transformers select causal structures, providing new insights into their functioning and interpretability.

Related papers

What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains [64.31313691823088]
In-context learning (ICL) is a capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context.<n>We show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram.
arXiv Detail & Related papers (2025-08-10T07:03:01Z)
Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer [15.196937229815445]
We show that attention with frozen key and query weights can perform competitively on language modeling.<n>We also design MixiT, an architecture with entirely random attention scores, with provably stable signal propagation.<n>Our results suggest that the transformer architecture has a built-in inductive bias towards forming specialized circuits.
arXiv Detail & Related papers (2025-06-01T18:42:39Z)
On the Robustness of Transformers against Context Hijacking for Linear Classification [26.1838836907147]
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities.<n>They can be disrupted by factually correct context, a phenomenon known as context hijacking.<n>We show that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations.
arXiv Detail & Related papers (2025-02-21T17:31:00Z)
An explainable transformer circuit for compositional generalization [4.446278061385101]
We identify and mechanistically interpret the circuit responsible for compositional induction in a compact transformer.<n>Using causal ablations, we validate the circuit and formalize its operation using a program-like description.<n>Our findings advance the understanding of complex behaviors in transformers and highlight such insights can provide a direct pathway for model control.
arXiv Detail & Related papers (2025-02-19T02:30:41Z)
Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge.<n>We propose TEGA, a logic-aware architecture that significantly improves the performance in first-order logical entailment.
arXiv Detail & Related papers (2025-01-01T07:05:32Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
How Transformers Learn Causal Structure with Gradient Descent [44.31729147722701]
Self-attention allows transformers to encode causal structure. We introduce an in-context learning task that requires learning latent causal structure. We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z)
What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.