What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains
- URL: http://arxiv.org/abs/2508.07208v1
- Date: Sun, 10 Aug 2025 07:03:01 GMT
- Title: What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains
- Authors: Chanakya Ekbote, Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang,
- Abstract summary: In-context learning (ICL) is a capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context.<n>We show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram.
- Score: 64.31313691823088
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.
Related papers
- Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers [0.10152838128195467]
We train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification task.<n>A single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking residuals and normalization layers.<n>A two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions.
arXiv Detail & Related papers (2025-10-28T22:25:19Z) - Selective Induction Heads: How Transformers Select Causal Structures In Context [50.09964990342878]
We introduce a novel framework that showcases transformers' ability to handle causal structures.<n>Our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed.<n>This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context.
arXiv Detail & Related papers (2025-09-09T23:13:41Z) - How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains [45.84704083061562]
We introduce a new framework for a principled analysis of transformers via Markov chains.<n>We show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture.
arXiv Detail & Related papers (2024-02-06T17:18:59Z) - Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens [9.590540796223715]
In this paper, we attempt to explore the in-context learning process in Transformers through a lens of representation learning.
The ICL inference process of the attention layer aligns with the training procedure of its dual model, generating token representation predictions.
We extend our theoretical conclusions to more complicated scenarios, including one Transformer layer and multiple attention layers.
arXiv Detail & Related papers (2023-10-20T01:55:34Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.