Anisotropy Is Inherent to Self-Attention in Transformers
- URL: http://arxiv.org/abs/2401.12143v2
- Date: Wed, 24 Jan 2024 16:07:00 GMT
- Title: Anisotropy Is Inherent to Self-Attention in Transformers
- Authors: Nathan Godey and \'Eric de la Clergerie and Beno\^it Sagot
- Abstract summary: We show that anisotropy can be observed empirically in language models with specific objectives.
We also show that the anisotropy problem extends to Transformers trained on other modalities.
- Score: 0.11510009152620666
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The representation degeneration problem is a phenomenon that is widely
observed among self-supervised learning methods based on Transformers. In NLP,
it takes the form of anisotropy, a singular property of hidden representations
which makes them unexpectedly close to each other in terms of angular distance
(cosine-similarity). Some recent works tend to show that anisotropy is a
consequence of optimizing the cross-entropy loss on long-tailed distributions
of tokens. We show in this paper that anisotropy can also be observed
empirically in language models with specific objectives that should not suffer
directly from the same consequences. We also show that the anisotropy problem
extends to Transformers trained on other modalities. Our observations suggest
that anisotropy is actually inherent to Transformers-based models.
Related papers
- Is Anisotropy Inherent to Transformers? [0.0]
We show that anisotropy can be observed empirically in language models with specific objectives.
We also show that the anisotropy problem extends to Transformers trained on other modalities.
arXiv Detail & Related papers (2023-06-13T09:54:01Z) - Entanglement Entropy in Ground States of Long-Range Fermionic Systems [0.0]
We study the scaling of ground state entanglement entropy of various free fermionic models on one dimensional lattices.
We ask if there exists a common $alpha_c$ across different systems governing the transition to area law scaling found in local systems.
arXiv Detail & Related papers (2023-02-13T23:08:01Z) - Statistical Properties of the Entropy from Ordinal Patterns [55.551675080361335]
Knowing the joint distribution of the pair Entropy-Statistical Complexity for a large class of time series models would allow statistical tests that are unavailable to date.
We characterize the distribution of the empirical Shannon's Entropy for any model under which the true normalized Entropy is neither zero nor one.
We present a bilateral test that verifies if there is enough evidence to reject the hypothesis that two signals produce ordinal patterns with the same Shannon's Entropy.
arXiv Detail & Related papers (2022-09-15T23:55:58Z) - Outliers Dimensions that Disrupt Transformers Are Driven by Frequency [79.22656609637525]
We show that the token frequency contributes to the outlier phenomenon.
We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
arXiv Detail & Related papers (2022-05-23T15:19:09Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - On Isotropy Calibration of Transformers [10.294618771570985]
Studies of the embedding space of transformer models suggest that the distribution of contextual representations is highly anisotropic.
A recent study shows that the embedding space of transformers is locally isotropic, which suggests that these models are already capable of exploiting the expressive capacity of their embedding space.
We conduct an empirical evaluation of state-of-the-art methods for isotropy calibration on transformers and find that they do not provide consistent improvements across models and tasks.
arXiv Detail & Related papers (2021-09-27T18:54:10Z) - Action Redundancy in Reinforcement Learning [54.291331971813364]
We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and action redundancy.
Our results suggest that action redundancy is a fundamental problem in reinforcement learning.
arXiv Detail & Related papers (2021-02-22T19:47:26Z) - Eigenstate entanglement entropy in $PT$ invariant non-Hermitian system [0.0]
We study a non-Hermitian, non-interacting model of fermions which is invariant under combined $PT$ transformation.
Our models show a phase transition from $PT$ unbroken phase to broken phase as we tune the hermiticity breaking parameter.
arXiv Detail & Related papers (2021-02-01T19:00:08Z) - Generalized Entropy Regularization or: There's Nothing Special about
Label Smoothing [83.78668073898001]
We introduce a family of entropy regularizers, which includes label smoothing as a special case.
We find that variance in model performance can be explained largely by the resulting entropy of the model.
We advise the use of other entropy regularization methods in its place.
arXiv Detail & Related papers (2020-05-02T12:46:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.