Is Anisotropy Inherent to Transformers?
- URL: http://arxiv.org/abs/2306.07656v1
- Date: Tue, 13 Jun 2023 09:54:01 GMT
- Title: Is Anisotropy Inherent to Transformers?
- Authors: Nathan Godey, \'Eric de la Clergerie, Beno\^it Sagot
- Abstract summary: We show that anisotropy can be observed empirically in language models with specific objectives.
We also show that the anisotropy problem extends to Transformers trained on other modalities.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The representation degeneration problem is a phenomenon that is widely
observed among self-supervised learning methods based on Transformers. In NLP,
it takes the form of anisotropy, a singular property of hidden representations
which makes them unexpectedly close to each other in terms of angular distance
(cosine-similarity). Some recent works tend to show that anisotropy is a
consequence of optimizing the cross-entropy loss on long-tailed distributions
of tokens. We show in this paper that anisotropy can also be observed
empirically in language models with specific objectives that should not suffer
directly from the same consequences. We also show that the anisotropy problem
extends to Transformers trained on other modalities. Our observations tend to
demonstrate that anisotropy might actually be inherent to Transformers-based
models.
Related papers
- Anisotropy Is Inherent to Self-Attention in Transformers [0.11510009152620666]
We show that anisotropy can be observed empirically in language models with specific objectives.
We also show that the anisotropy problem extends to Transformers trained on other modalities.
arXiv Detail & Related papers (2024-01-22T17:26:55Z) - Entanglement Entropy in Ground States of Long-Range Fermionic Systems [0.0]
We study the scaling of ground state entanglement entropy of various free fermionic models on one dimensional lattices.
We ask if there exists a common $alpha_c$ across different systems governing the transition to area law scaling found in local systems.
arXiv Detail & Related papers (2023-02-13T23:08:01Z) - Statistical Properties of the Entropy from Ordinal Patterns [55.551675080361335]
Knowing the joint distribution of the pair Entropy-Statistical Complexity for a large class of time series models would allow statistical tests that are unavailable to date.
We characterize the distribution of the empirical Shannon's Entropy for any model under which the true normalized Entropy is neither zero nor one.
We present a bilateral test that verifies if there is enough evidence to reject the hypothesis that two signals produce ordinal patterns with the same Shannon's Entropy.
arXiv Detail & Related papers (2022-09-15T23:55:58Z) - Outliers Dimensions that Disrupt Transformers Are Driven by Frequency [79.22656609637525]
We show that the token frequency contributes to the outlier phenomenon.
We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
arXiv Detail & Related papers (2022-05-23T15:19:09Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - A Probabilistic Interpretation of Transformers [91.3755431537592]
We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families.
We state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.
arXiv Detail & Related papers (2022-04-28T23:05:02Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - On Isotropy Calibration of Transformers [10.294618771570985]
Studies of the embedding space of transformer models suggest that the distribution of contextual representations is highly anisotropic.
A recent study shows that the embedding space of transformers is locally isotropic, which suggests that these models are already capable of exploiting the expressive capacity of their embedding space.
We conduct an empirical evaluation of state-of-the-art methods for isotropy calibration on transformers and find that they do not provide consistent improvements across models and tasks.
arXiv Detail & Related papers (2021-09-27T18:54:10Z) - Action Redundancy in Reinforcement Learning [54.291331971813364]
We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and action redundancy.
Our results suggest that action redundancy is a fundamental problem in reinforcement learning.
arXiv Detail & Related papers (2021-02-22T19:47:26Z) - Dynamics of Ultracold Bosons in Artificial Gauge Fields: Angular
Momentum, Fragmentation, and the Variance of Entropy [0.0]
We consider the dynamics of two-dimensional interacting ultracold bosons triggered by suddenly switching on an artificial gauge field.
We analyze the emergent dynamics by monitoring the angular momentum, the fragmentation as well the entropy and variance of the entropy of absorption or single-shot images.
arXiv Detail & Related papers (2020-12-17T19:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.