Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
- URL: http://arxiv.org/abs/2205.11380v1
- Date: Mon, 23 May 2022 15:19:09 GMT
- Title: Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
- Authors: Giovanni Puccetti, Anna Rogers, Aleksandr Drozd and Felice
Dell'Orletta
- Abstract summary: We show that the token frequency contributes to the outlier phenomenon.
We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
- Score: 79.22656609637525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models are known to display anisotropic behavior:
the token embeddings are not homogeneously spread in space, but rather
accumulate along certain directions. A related recent finding is the outlier
phenomenon: the parameters in the final element of Transformer layers that
consistently have unusual magnitude in the same dimension across the model, and
significantly degrade its performance if disabled. We replicate the evidence
for the outlier phenomenon and we link it to the geometry of the embedding
space. Our main finding is that in both BERT and RoBERTa the token frequency,
known to contribute to anisotropicity, also contributes to the outlier
phenomenon. In its turn, the outlier phenomenon contributes to the "vertical"
self-attention pattern that enables the model to focus on the special tokens.
We also find that, surprisingly, the outlier effect on the model performance
varies by layer, and that variance is also related to the correlation between
outlier magnitude and encoded token frequency.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Unveiling and Controlling Anomalous Attention Distribution in Transformers [8.456319173083315]
Waiver phenomenon allows elements to absorb excess attention without affecting their contribution to information.
In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods.
arXiv Detail & Related papers (2024-06-26T11:53:35Z) - Transformer Normalisation Layers and the Independence of Semantic Subspaces [17.957364289876548]
We consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution.
We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability.
We observe a 1% rate of circuit collapse when the norms are artificially perturbed by $lesssim$10%.
arXiv Detail & Related papers (2024-06-25T16:16:38Z) - Anisotropy Is Inherent to Self-Attention in Transformers [0.11510009152620666]
We show that anisotropy can be observed empirically in language models with specific objectives.
We also show that the anisotropy problem extends to Transformers trained on other modalities.
arXiv Detail & Related papers (2024-01-22T17:26:55Z) - Is Anisotropy Inherent to Transformers? [0.0]
We show that anisotropy can be observed empirically in language models with specific objectives.
We also show that the anisotropy problem extends to Transformers trained on other modalities.
arXiv Detail & Related papers (2023-06-13T09:54:01Z) - Random unitaries, Robustness, and Complexity of Entanglement [0.0]
It is widely accepted that the dynamic of entanglement in presence of a generic circuit can be predicted by the knowledge of the statistical properties of the entanglement spectrum.
We tested this assumption by applying a Metropolis-like entanglement cooling algorithm generated by different sets of local gates.
We observe that the entanglement dynamics are strongly dependent not just on the different sets of gates but also on the phase.
arXiv Detail & Related papers (2022-10-24T18:00:06Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - Orthogonal Jacobian Regularization for Unsupervised Disentanglement in
Image Generation [64.92152574895111]
We propose a simple Orthogonal Jacobian Regularization (OroJaR) to encourage deep generative model to learn disentangled representations.
Our method is effective in disentangled and controllable image generation, and performs favorably against the state-of-the-art methods.
arXiv Detail & Related papers (2021-08-17T15:01:46Z) - Quantum asymmetry and noisy multi-mode interferometry [55.41644538483948]
Quantum asymmetry is a physical resource which coincides with the amount of coherence between the eigenspaces of a generator.
We show that the asymmetry may emphincrease as a result of a emphdecrease of coherence inside a degenerate subspace.
arXiv Detail & Related papers (2021-07-23T07:30:57Z) - Hard-label Manifolds: Unexpected Advantages of Query Efficiency for
Finding On-manifold Adversarial Examples [67.23103682776049]
Recent zeroth order hard-label attacks on image classification models have shown comparable performance to their first-order, gradient-level alternatives.
It was recently shown in the gradient-level setting that regular adversarial examples leave the data manifold, while their on-manifold counterparts are in fact generalization errors.
We propose an information-theoretic argument based on a noisy manifold distance oracle, which leaks manifold information through the adversary's gradient estimate.
arXiv Detail & Related papers (2021-03-04T20:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.