Related papers: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

URL: http://arxiv.org/abs/2402.04161v1
Date: Tue, 6 Feb 2024 17:18:59 GMT
Title: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Authors: Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar
Abstract summary: We study the sequential modeling capabilities of transformers through the lens of Markov chains. Inspired by the Markovianity of natural languages, we model the data as a Markovian source. We show the existence of global minima and bad local minima contingent upon the specific data characteristics and the transformer architecture.
Score: 48.146073732531605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. A key ingredient behind their success is the generative pretraining procedure, during which these models are trained on a large text corpus in an auto-regressive manner. To shed light on this phenomenon, we propose a new framework that allows both theory and systematic experiments to study the sequential modeling capabilities of transformers through the lens of Markov chains. Inspired by the Markovianity of natural languages, we model the data as a Markovian source and utilize this framework to systematically study the interplay between the data-distributional properties, the transformer architecture, the learnt distribution, and the final model performance. In particular, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima and bad local minima contingent upon the specific data characteristics and the transformer architecture. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. We further investigate these findings in the broader context of higher order Markov chains and deeper architectures, and outline open problems in this arena. Code is available at \url{https://github.com/Bond1995/Markov}.

Related papers

Incremental Learning of Sparse Attention Patterns in Transformers [29.54151079577767]
This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions.<n>We identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns.
arXiv Detail & Related papers (2026-02-22T12:16:06Z)
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems [14.98480544580102]
We first propose a mathematical construct to represent multi-modal, multi-scale data.<n>We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization.<n>We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention.
arXiv Detail & Related papers (2025-09-18T21:44:07Z)
Selective Induction Heads: How Transformers Select Causal Structures In Context [50.09964990342878]
We introduce a novel framework that showcases transformers' ability to handle causal structures.<n>Our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed.<n>This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context.
arXiv Detail & Related papers (2025-09-09T23:13:41Z)
What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains [64.31313691823088]
In-context learning (ICL) is a capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context.<n>We show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram.
arXiv Detail & Related papers (2025-08-10T07:03:01Z)
Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models [12.112842686827669]
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks.<n>We investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability.<n>Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.
arXiv Detail & Related papers (2025-06-02T17:39:31Z)
Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach [1.4582633500696451]
This paper proposes a novel knowledge distillation framework that transfers multiscale structural knowledge from GNN teacher models to Transformer student models. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment.
arXiv Detail & Related papers (2025-02-27T05:14:47Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Transformers on Markov Data: Constant Depth Suffices [25.83132046480226]
We study the behavior of transformers on data drawn from kth Markov processes. We find that a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from kth Markov sources.
arXiv Detail & Related papers (2024-07-25T01:07:09Z)
Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers. We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z)
Toward a Theory of Tokenization in LLMs [26.516041872337887]
We study tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes.<n>We show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $ktextth$-order Markov sources near optimally.
arXiv Detail & Related papers (2024-04-12T09:01:14Z)
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers [41.82477691012942]
We study learning a 1-layer self-attention model from a set of prompts and associated output data. We first establish a precise mapping between the self-attention mechanism and Markov models. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens.
arXiv Detail & Related papers (2024-02-21T03:51:34Z)
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains [28.41876902994335]
We introduce a simple Markov Chain sequence modeling task to study how this in-context learning (ICL) capability emerges. Transformers trained on this task form emphstatistical induction heads which compute accurate next-token probabilities. We show how successful learning results from the interaction between the transformer's layers, and uncover evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution.
arXiv Detail & Related papers (2024-02-16T18:28:36Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z)
Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning [53.73083199055093]
We show that attention-based architectures (e.g., Transformers) are fairly robust to distribution shifts. Our experiments show that replacing convolutional networks with Transformers can greatly reduce catastrophic forgetting of previous devices.
arXiv Detail & Related papers (2021-06-10T21:04:18Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.