Related papers: Constrained belief updates explain geometric structures in transformer representations

Constrained belief updates explain geometric structures in transformer representations

URL: http://arxiv.org/abs/2502.01954v1
Date: Tue, 04 Feb 2025 03:03:54 GMT
Title: Constrained belief updates explain geometric structures in transformer representations
Authors: Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam S. Shai,
Abstract summary: We integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models.<n>We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: What computational structures emerge in transformers trained on next-token prediction? In this work, we provide evidence that transformers implement constrained Bayesian belief updating -- a parallelized version of partial Bayesian inference shaped by architectural constraints. To do this, we integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models that generate rich geometric patterns in neural activations. We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure. We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail -- including the attention pattern, OV-vectors, and embedding vectors -- by modifying the equations for optimal future token predictions to account for the architectural constraints of attention. Our approach provides a principled lens on how gradient descent resolves the tension between optimal prediction and architectural design.

Related papers

Universal Approximation Theorem for a Single-Layer Transformer [0.0]
Deep learning employs multi-layer neural networks trained via the backpropagation algorithm.<n>Transformers have achieved state-of-the-art performance in natural language processing.<n>We prove that a single-layer Transformer, comprising one self-attention layer followed by a position-wise feed-forward network with ReLU activation, can any continuous sequence-to-sequence mapping on a compact domain to arbitrary precision.
arXiv Detail & Related papers (2025-07-11T11:37:39Z)
Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z)
Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures [1.9567015559455132]
We present a framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM)<n>Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture.
arXiv Detail & Related papers (2025-05-01T19:19:29Z)
Random Sparse Lifts: Construction, Analysis and Convergence of finite sparse networks [17.487761710665968]
We present a framework to define a large class of neural networks for which, by construction, training by gradient flow provably reaches arbitrarily low loss when the number of parameters grows.
arXiv Detail & Related papers (2025-01-10T12:52:00Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Transformers represent belief state geometry in their residual stream [40.803656512527645]
We present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. Our work provides a general framework connecting the structure of training data to the geometric structure of activations inside transformers.
arXiv Detail & Related papers (2024-05-24T21:14:10Z)
A Theory of Topological Derivatives for Inverse Rendering of Geometry [87.49881303178061]
We introduce a theoretical framework for differentiable surface evolution that allows discrete topology changes through the use of topological derivatives. We validate the proposed theory with optimization of closed curves in 2D and surfaces in 3D to lend insights into limitations of current methods.
arXiv Detail & Related papers (2023-08-19T00:55:55Z)
Approximation Rate of the Transformer Architecture for Sequence Modeling [18.166959969957315]
We consider a class of non-linear relationships and identify a novel notion of complexity measures to establish an explicit Jackson-type approximation rate estimate for the Transformer.<n>This rate reveals the structural properties of the Transformer and suggests the types of sequential relationships it is best suited for approximating.
arXiv Detail & Related papers (2023-05-29T10:56:36Z)
BayesFormer: Transformer with Uncertainty Estimation [31.206243748162553]
We introduce BayesFormer, a Transformer model with dropouts designed by Bayesian theory. We show improvements across the board: language modeling and classification, long-sequence understanding, machine translation and acquisition function for active learning.
arXiv Detail & Related papers (2022-06-02T01:54:58Z)
Geometric Methods for Sampling, Optimisation, Inference and Adaptive Agents [102.42623636238399]
We identify fundamental geometric structures that underlie the problems of sampling, optimisation, inference and adaptive decision-making. We derive algorithms that exploit these geometric structures to solve these problems efficiently.
arXiv Detail & Related papers (2022-03-20T16:23:17Z)
Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
A General Framework for Consistent Structured Prediction with Implicit Loss Embeddings [113.15416137912399]
We propose and analyze a novel theoretical and algorithmic framework for structured prediction. We study a large class of loss functions that implicitly defines a suitable geometry on the problem. When dealing with output spaces with infinite cardinality, a suitable implicit formulation of the estimator is shown to be crucial.
arXiv Detail & Related papers (2020-02-13T10:30:04Z)
Understanding Graph Neural Networks with Generalized Geometric Scattering Transforms [67.88675386638043]
The scattering transform is a multilayered wavelet-based deep learning architecture that acts as a model of convolutional neural networks. We introduce windowed and non-windowed geometric scattering transforms for graphs based upon a very general class of asymmetric wavelets. We show that these asymmetric graph scattering transforms have many of the same theoretical guarantees as their symmetric counterparts.
arXiv Detail & Related papers (2019-11-14T17:23:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.