Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings
- URL: http://arxiv.org/abs/2305.13571v1
- Date: Tue, 23 May 2023 01:03:40 GMT
- Title: Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings
- Authors: Ta-Chung Chi and Ting-Han Fan and Li-Wei Chen and Alexander I.
Rudnicky and Peter J. Ramadge
- Abstract summary: We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
- Score: 68.61185138897312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of positional embeddings in transformer language models is widely
accepted. However, recent research has called into question the necessity of
such embeddings. We further extend this inquiry by demonstrating that a
randomly initialized and frozen transformer language model, devoid of
positional embeddings, inherently encodes strong positional information through
the shrinkage of self-attention variance. To quantify this variance, we derive
the underlying distribution of each step within a transformer layer. Through
empirical validation using a fully pretrained model, we show that the variance
shrinkage effect still persists after extensive gradient updates. Our findings
serve to justify the decision to discard positional embeddings and thus
facilitate more efficient pretraining of transformer language models.
Related papers
- On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning [0.0]
We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
arXiv Detail & Related papers (2022-11-17T13:34:08Z) - Transformer Language Models without Positional Encodings Still Learn
Positional Information [45.42248458957122]
We find that transformer language models without any explicit positional encoding are still competitive with standard models.
We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
arXiv Detail & Related papers (2022-03-30T19:37:07Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Pathologies in priors and inference for Bayesian transformers [71.97183475225215]
No successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist.
We find that weight-space inference in transformers does not work well, regardless of the approximate posterior.
We propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights.
arXiv Detail & Related papers (2021-10-08T10:35:27Z) - The Case for Translation-Invariant Self-Attention in Transformer-Based
Language Models [11.148662334602639]
We analyze the position embeddings of existing language models and find strong evidence of translation invariance.
We propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion.
arXiv Detail & Related papers (2021-06-03T15:56:26Z) - Transformer-Based Source-Free Domain Adaptation [134.67078085569017]
We study the task of source-free domain adaptation (SFDA), where the source data are not available during target adaptation.
We propose a generic and effective framework based on Transformer, named TransDA, for learning a generalized model for SFDA.
arXiv Detail & Related papers (2021-05-28T23:06:26Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.