Related papers: A Probabilistic Interpretation of Transformers

A Probabilistic Interpretation of Transformers

URL: http://arxiv.org/abs/2205.01080v1
Date: Thu, 28 Apr 2022 23:05:02 GMT
Title: A Probabilistic Interpretation of Transformers
Authors: Alexander Shim
Abstract summary: We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. We state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.
Score: 91.3755431537592
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.

Related papers

Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements [8.121681696358717]
We show that at initialisation, transformers perform "linear" dimensionality reduction.<n>We also show that within the transformer block, a graph Laplacian term arises from our arguments, rather than an attention matrix.
arXiv Detail & Related papers (2025-07-28T17:56:34Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z)
Higher-Order Transformer Derivative Estimates for Explicit Pathwise Learning Guarantees [9.305677878388664]
This paper fills a gap in the literature by precisely estimating all the higher-order derivatives of all orders for the transformer model. We obtain fully-explicit estimates of all constants in terms of the number of attention heads, the depth and width of each transformer block, and the number of normalization layers. We conclude that real-world transformers can learn from $N$ samples of a single Markov process's trajectory at a rate of $O(operatornamepolylog(N/sqrtN)$.
arXiv Detail & Related papers (2024-05-26T13:19:32Z)
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance. improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z)
Towards Understanding Inductive Bias in Transformers: A View From Infinity [9.00214539845063]
We argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions. We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
arXiv Detail & Related papers (2024-02-07T19:00:01Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)
The Convolution Exponential and Generalized Sylvester Flows [82.18442368078804]
This paper introduces a new method to build linear flows, by taking the exponential of a linear transformation. An important insight is that the exponential can be computed implicitly, which allows the use of convolutional layers. We show that the convolution exponential outperforms other linear transformations in generative flows on CIFAR10.
arXiv Detail & Related papers (2020-06-02T19:43:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.