Related papers: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

URL: http://arxiv.org/abs/2402.04161v2
Date: Mon, 21 Jul 2025 14:23:40 GMT
Title: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Authors: Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar,
Abstract summary: We introduce a new framework for a principled analysis of transformers via Markov chains.<n>We show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture.
Score: 45.84704083061562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. To deepen our understanding of their sequential modeling capabilities, there is a growing interest in using Markov input processes to study them. A key finding is that when trained on first-order Markov chains, transformers with two or more layers consistently develop an induction head mechanism to estimate the in-context bigram conditional distribution. In contrast, single-layer transformers, unable to form an induction head, directly learn the Markov kernel but often face a surprising challenge: they become trapped in local minima representing the unigram distribution, whereas deeper models reliably converge to the ground-truth bigram. While single-layer transformers can theoretically model first-order Markov chains, their empirical failure to learn this simple kernel in practice remains a curious phenomenon. To explain this contrasting behavior of single-layer models, in this paper we introduce a new framework for a principled analysis of transformers via Markov chains. Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture. We precisely delineate the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. Finally, we outline several open problems in this arena. Code is available at https://github.com/Bond1995/Markov .

Related papers

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models [12.112842686827669]
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks.<n>We investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability.<n>Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.
arXiv Detail & Related papers (2025-06-02T17:39:31Z)
Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach [1.4582633500696451]
This paper proposes a novel knowledge distillation framework that transfers multiscale structural knowledge from GNN teacher models to Transformer student models. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment.
arXiv Detail & Related papers (2025-02-27T05:14:47Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Transformers on Markov Data: Constant Depth Suffices [25.83132046480226]
We study the behavior of transformers on data drawn from kth Markov processes. We find that a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from kth Markov sources.
arXiv Detail & Related papers (2024-07-25T01:07:09Z)
Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers. We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z)
Toward a Theory of Tokenization in LLMs [26.516041872337887]
We study tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes.<n>We show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $ktextth$-order Markov sources near optimally.
arXiv Detail & Related papers (2024-04-12T09:01:14Z)
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers [41.82477691012942]
We study learning a 1-layer self-attention model from a set of prompts and associated output data. We first establish a precise mapping between the self-attention mechanism and Markov models. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens.
arXiv Detail & Related papers (2024-02-21T03:51:34Z)
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains [28.41876902994335]
We introduce a simple Markov Chain sequence modeling task to study how this in-context learning (ICL) capability emerges. Transformers trained on this task form emphstatistical induction heads which compute accurate next-token probabilities. We show how successful learning results from the interaction between the transformer's layers, and uncover evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution.
arXiv Detail & Related papers (2024-02-16T18:28:36Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z)
Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning [53.73083199055093]
We show that attention-based architectures (e.g., Transformers) are fairly robust to distribution shifts. Our experiments show that replacing convolutional networks with Transformers can greatly reduce catastrophic forgetting of previous devices.
arXiv Detail & Related papers (2021-06-10T21:04:18Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.