A Probabilistic Interpretation of Transformers
- URL: http://arxiv.org/abs/2205.01080v1
- Date: Thu, 28 Apr 2022 23:05:02 GMT
- Title: A Probabilistic Interpretation of Transformers
- Authors: Alexander Shim
- Abstract summary: We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families.
We state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.
- Score: 91.3755431537592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a probabilistic interpretation of exponential dot product
attention of transformers and contrastive learning based off of exponential
families. The attention sublayer of transformers is equivalent to a gradient
ascent step of the log normalizer, which is the log-sum-exp term in the
Hopfield theory of attention. This ascent step induces a parallel expansion of
points, which is counterbalanced by a contraction from layer normalization. We
also state theoretical limitations of our theory and the Hopfield theory and
suggest directions for resolution.
Related papers
- Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity.
We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders.
We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z) - Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer model does not always lead to enhanced performance.
improved generalization ability occurs as the model memorizes the training samples.
We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z) - Towards Understanding Inductive Bias in Transformers: A View From Infinity [9.00214539845063]
We argue transformers tend to be biased towards more permutation symmetric functions in sequence space.
We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions.
We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
arXiv Detail & Related papers (2024-02-07T19:00:01Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits.
This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - Entanglement Transitions from Stochastic Resetting of Non-Hermitian
Quasiparticles [0.0]
We write down a renewal equation for the statistics of the entanglement entropy and show that depending on the spectrum of quasiparticle decay rates different entanglement scaling can arise and even sharp entanglement phase transitions.
When applied to a Quantum Ising chain where the transverse magnetization is measured by quantum jumps, our theory predicts a critical phase with logarithmic scaling of the entanglement, an area law phase and a continuous phase transition between them, with an effective central charge vanishing as a square root at the transition point.
arXiv Detail & Related papers (2021-11-05T13:38:04Z) - On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z) - The Convolution Exponential and Generalized Sylvester Flows [82.18442368078804]
This paper introduces a new method to build linear flows, by taking the exponential of a linear transformation.
An important insight is that the exponential can be computed implicitly, which allows the use of convolutional layers.
We show that the convolution exponential outperforms other linear transformations in generative flows on CIFAR10.
arXiv Detail & Related papers (2020-06-02T19:43:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.