Transformer Language Models without Positional Encodings Still Learn
Positional Information
- URL: http://arxiv.org/abs/2203.16634v1
- Date: Wed, 30 Mar 2022 19:37:07 GMT
- Title: Transformer Language Models without Positional Encodings Still Learn
Positional Information
- Authors: Adi Haviv, Ori Ram, Ofir Press, Peter Izsak and Omer Levy
- Abstract summary: We find that transformer language models without any explicit positional encoding are still competitive with standard models.
We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
- Score: 45.42248458957122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers typically require some form of positional encoding, such as
positional embeddings, to process natural language sequences. Surprisingly, we
find that transformer language models without any explicit positional encoding
are still competitive with standard models, and that this phenomenon is robust
across different datasets, model sizes, and sequence lengths. Probing
experiments reveal that such models acquire an implicit notion of absolute
positions throughout the network, effectively compensating for the missing
information. We conjecture that causal attention enables the model to infer the
number of predecessors that each token can attend to, thereby approximating its
absolute position.
Related papers
- Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - The Impact of Positional Encodings on Multilingual Compression [3.454503173118508]
Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture.
We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models.
arXiv Detail & Related papers (2021-09-11T23:22:50Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Learning to Encode Position for Transformer with Continuous Dynamical
Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models.
We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.