Related papers: Transformer Language Models without Positional Encodings Still Learn Positional Information

Transformer Language Models without Positional Encodings Still Learn Positional Information

URL: http://arxiv.org/abs/2203.16634v1
Date: Wed, 30 Mar 2022 19:37:07 GMT
Title: Transformer Language Models without Positional Encodings Still Learn Positional Information
Authors: Adi Haviv, Ori Ram, Ofir Press, Peter Izsak and Omer Levy
Abstract summary: We find that transformer language models without any explicit positional encoding are still competitive with standard models. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
Score: 45.42248458957122
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers typically require some form of positional encoding, such as positional embeddings, to process natural language sequences. Surprisingly, we find that transformer language models without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.

Related papers

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings [3.0559252110342703]
We propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens.
arXiv Detail & Related papers (2024-12-30T03:35:41Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z)
The Impact of Positional Encodings on Multilingual Compression [3.454503173118508]
Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models.
arXiv Detail & Related papers (2021-09-11T23:22:50Z)
Relative Positional Encoding for Speech Recognition and Direct Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer. As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
Learning to Encode Position for Transformer with Continuous Dynamical Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.