Related papers: Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements

Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements

URL: http://arxiv.org/abs/2507.21040v1
Date: Mon, 28 Jul 2025 17:56:34 GMT
Title: Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements
Authors: Aditya Ravuri, Neil D. Lawrence,
Abstract summary: We show that at initialisation, transformers perform "linear" dimensionality reduction.<n>We also show that within the transformer block, a graph Laplacian term arises from our arguments, rather than an attention matrix.
Score: 8.121681696358717
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We propose a probabilistic interpretation of transformers as unrolled inference steps assuming a probabilistic Laplacian Eigenmaps model from the ProbDR framework. Our derivation shows that at initialisation, transformers perform "linear" dimensionality reduction. We also show that within the transformer block, a graph Laplacian term arises from our arguments, rather than an attention matrix (which we interpret as an adjacency matrix). We demonstrate that simply subtracting the identity from the attention matrix (and thereby taking a graph diffusion step) improves validation performance on a language model and a simple vision transformer.

Related papers

The calculus of variations of the Transformer on the hyperspherical tangent bundle [0.0]
We offer a theoretical mathematical background to Transformers through Lagrangian optimization across the token space.<n>The Transformer, as a flow map, exists in the tangent fiber for each token along the high-dimensional unit sphere.<n>We derive the Euler-Lagrange equation for the Transformer.
arXiv Detail & Related papers (2025-07-21T09:43:33Z)
Graph Transformers Dream of Electric Flow [72.06286909236827]
We show that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems.<n>We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers' errors by the errors of the underlying algorithms.<n>Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data.
arXiv Detail & Related papers (2024-10-22T05:11:45Z)
Can Transformers Do Enumerative Geometry? [44.99833362998488]
We introduce a Transformer-based approach to computational enumerative geometry.<n>We compute intersection numbers across a value range from $10-45$ to $1045$.<n>We show that the network is implicitly modeling the Virasoro constraints in a purely data-driven manner.
arXiv Detail & Related papers (2024-08-27T09:44:01Z)
Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z)
Transformers Can Represent $n$-gram Language Models [56.06361029539347]
We focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM.
arXiv Detail & Related papers (2024-04-23T12:51:37Z)
Towards Understanding Inductive Bias in Transformers: A View From Infinity [9.00214539845063]
We argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions. We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
arXiv Detail & Related papers (2024-02-07T19:00:01Z)
Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs. Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z)
Mapping of attention mechanisms to a generalized Potts model [50.91742043564049]
We show that training a neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method. We also compute the generalization error of self-attention in a model scenario analytically using the replica method.
arXiv Detail & Related papers (2023-04-14T16:32:56Z)
A Probabilistic Interpretation of Transformers [91.3755431537592]
We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. We state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.
arXiv Detail & Related papers (2022-04-28T23:05:02Z)
XAI for Transformers: Better Explanations through Conservative Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z)
The Convolution Exponential and Generalized Sylvester Flows [82.18442368078804]
This paper introduces a new method to build linear flows, by taking the exponential of a linear transformation. An important insight is that the exponential can be computed implicitly, which allows the use of convolutional layers. We show that the convolution exponential outperforms other linear transformations in generative flows on CIFAR10.
arXiv Detail & Related papers (2020-06-02T19:43:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.