Transformer Normalisation Layers and the Independence of Semantic Subspaces
- URL: http://arxiv.org/abs/2406.17837v1
- Date: Tue, 25 Jun 2024 16:16:38 GMT
- Title: Transformer Normalisation Layers and the Independence of Semantic Subspaces
- Authors: Stephen Menary, Samuel Kaski, Andre Freitas,
- Abstract summary: We consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution.
We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability.
We observe a 1% rate of circuit collapse when the norms are artificially perturbed by $lesssim$10%.
- Score: 17.957364289876548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the $L_2$-norms of the query/key/value vectors, predicting a phenomenon of circuit collapse when sparse-attention shifts to a different token. Empirically, we investigate the sensitivity of real-world models trained for mathematical addition, observing a 1% rate of circuit collapse when the norms are artificially perturbed by $\lesssim$10%. We contrast Pre-Norm with QKV-Norm, which places normalisation after the attention head's linear operators. Theoretically this relaxes the representational constraints. Empirically we observe comparable in-distribution but worse out-of-distribution performance.
Related papers
- On the phase diagram of extensive-rank symmetric matrix denoising beyond rotational invariance [5.058205542605482]
We make progress towards the understanding of matrix denoising when the hidden signal is a factored matrix $XXintercal$ that is not rotationally invariant.
We argue that it is only beyond the transition that factorisation, i.e., estimating $X$ itself, becomes possible up to sign and permutation universality.
We also argue that it is only beyond the transition that factorisation, i.e., estimating $X$ itself, becomes possible up to sign and permutation universality.
arXiv Detail & Related papers (2024-11-04T10:50:37Z) - Refined Risk Bounds for Unbounded Losses via Transductive Priors [58.967816314671296]
We revisit the sequential variants of linear regression with the squared loss, classification problems with hinge loss, and logistic regression.
Our key tools are based on the exponential weights algorithm with carefully chosen transductive priors.
arXiv Detail & Related papers (2024-10-29T00:01:04Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Operator space fragmentation in perturbed Floquet-Clifford circuits [0.0]
Floquet quantum circuits are able to realise a wide range of non-equilibrium quantum states.
We investigate the stability of operator localisation and emergence of chaos in random Floquet-Clifford circuits.
arXiv Detail & Related papers (2024-08-02T19:18:30Z) - A metaplectic perspective of uncertainty principles in the Linear Canonical Transform domain [0.0]
We derive Heisenberg uncertainty principles for pairs of Linear Canonical Transforms of a given function.
We also propose a new quadratic phase-space distribution, which represents a signal along two intermediate directions in the time-frequency plane.
arXiv Detail & Related papers (2024-05-17T09:26:48Z) - Uniformly Decaying Subspaces for Error Mitigated Quantum Computation [2.7363128425496868]
We present a general condition to obtain subspaces that decay uniformly in a system governed by the Lindblad master equation.
The expectation values of dynamics encoded in such subspaces are unbiased estimators of noise-free expectation values.
We show that such subspaces can be used to eliminate bias up to first order variations in the decay rates without requiring full knowledge of noise.
arXiv Detail & Related papers (2024-02-29T22:25:19Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Outliers Dimensions that Disrupt Transformers Are Driven by Frequency [79.22656609637525]
We show that the token frequency contributes to the outlier phenomenon.
We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
arXiv Detail & Related papers (2022-05-23T15:19:09Z) - Unsupervised Disentanglement with Tensor Product Representations on the
Torus [78.6315881294899]
Current methods for learning representations with auto-encoders almost exclusively employ vectors as the latent representations.
In this work, we propose to employ a tensor product structure for this purpose.
In contrast to the conventional variations methods, which are targeted toward normally distributed features, the latent space in our representation is distributed uniformly over a set of unit circles.
arXiv Detail & Related papers (2022-02-13T04:23:12Z) - Causal Expectation-Maximisation [70.45873402967297]
We show that causal inference is NP-hard even in models characterised by polytree-shaped graphs.
We introduce the causal EM algorithm to reconstruct the uncertainty about the latent variables from data about categorical manifest variables.
We argue that there appears to be an unnoticed limitation to the trending idea that counterfactual bounds can often be computed without knowledge of the structural equations.
arXiv Detail & Related papers (2020-11-04T10:25:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.