Transformer Normalisation Layers and the Independence of Semantic Subspaces
- URL: http://arxiv.org/abs/2406.17837v1
- Date: Tue, 25 Jun 2024 16:16:38 GMT
- Title: Transformer Normalisation Layers and the Independence of Semantic Subspaces
- Authors: Stephen Menary, Samuel Kaski, Andre Freitas,
- Abstract summary: We consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution.
We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability.
We observe a 1% rate of circuit collapse when the norms are artificially perturbed by $lesssim$10%.
- Score: 17.957364289876548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the $L_2$-norms of the query/key/value vectors, predicting a phenomenon of circuit collapse when sparse-attention shifts to a different token. Empirically, we investigate the sensitivity of real-world models trained for mathematical addition, observing a 1% rate of circuit collapse when the norms are artificially perturbed by $\lesssim$10%. We contrast Pre-Norm with QKV-Norm, which places normalisation after the attention head's linear operators. Theoretically this relaxes the representational constraints. Empirically we observe comparable in-distribution but worse out-of-distribution performance.
Related papers
- A metaplectic perspective of uncertainty principles in the Linear Canonical Transform domain [0.0]
We derive Heisenberg uncertainty principles for pairs of Linear Canonical Transforms of a given function.
We also propose a new quadratic phase-space distribution, which represents a signal along two intermediate directions in the time-frequency plane.
arXiv Detail & Related papers (2024-05-17T09:26:48Z) - Uniformly Decaying Subspaces for Error Mitigated Quantum Computation [2.7363128425496868]
We present a general condition to obtain subspaces that decay uniformly in a system governed by the Lindblad master equation.
The expectation values of dynamics encoded in such subspaces are unbiased estimators of noise-free expectation values.
We show that such subspaces can be used to eliminate bias up to first order variations in the decay rates without requiring full knowledge of noise.
arXiv Detail & Related papers (2024-02-29T22:25:19Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural
Networks [4.213427823201119]
Our study reveals new theoretical insights into over-smoothing and feature over-correlation in deep graph neural networks.
We show the prevalence of invariant subspaces, demonstrating a fixed relative behavior unaffected by feature transformations.
We empirically extend our insights to the non-linear case, demonstrating the inability of existing models to capture linearly independent features.
arXiv Detail & Related papers (2023-08-31T15:22:31Z) - Outliers Dimensions that Disrupt Transformers Are Driven by Frequency [79.22656609637525]
We show that the token frequency contributes to the outlier phenomenon.
We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
arXiv Detail & Related papers (2022-05-23T15:19:09Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Zero Pixel Directional Boundary by Vector Transform [77.63061686394038]
We re-interpret boundaries as 1-D surfaces and formulate a one-to-one vector transform function that allows for training of boundary prediction completely avoiding the class imbalance issue.
Our problem formulation leads to the estimation of direction as well as richer contextual information of the boundary, and, if desired, the availability of zero-pixel thin boundaries also at training time.
arXiv Detail & Related papers (2022-03-16T17:55:31Z) - Unsupervised Disentanglement with Tensor Product Representations on the
Torus [78.6315881294899]
Current methods for learning representations with auto-encoders almost exclusively employ vectors as the latent representations.
In this work, we propose to employ a tensor product structure for this purpose.
In contrast to the conventional variations methods, which are targeted toward normally distributed features, the latent space in our representation is distributed uniformly over a set of unit circles.
arXiv Detail & Related papers (2022-02-13T04:23:12Z) - Implicit Bias of MSE Gradient Optimization in Underparameterized Neural
Networks [0.0]
We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow.
We show that the network learns eigenfunctions of an integral operator $T_Kinfty$ determined by the Neural Tangent Kernel (NTK)
We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
arXiv Detail & Related papers (2022-01-12T23:28:41Z) - Causal Expectation-Maximisation [70.45873402967297]
We show that causal inference is NP-hard even in models characterised by polytree-shaped graphs.
We introduce the causal EM algorithm to reconstruct the uncertainty about the latent variables from data about categorical manifest variables.
We argue that there appears to be an unnoticed limitation to the trending idea that counterfactual bounds can often be computed without knowledge of the structural equations.
arXiv Detail & Related papers (2020-11-04T10:25:13Z) - Principled Interpolation in Normalizing Flows [5.582101184758527]
Generative models based on normalizing flows are very successful in modeling complex data distributions.
straightforward linears show unexpected side effects, as paths lie outside the area where samples are observed.
This observation suggests that correcting the norm should generally result in betters, but it is not clear how to correct the norm in an unambiguous way.
arXiv Detail & Related papers (2020-10-22T21:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.