Related papers: Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings

URL: http://arxiv.org/abs/2501.00073v1
Date: Mon, 30 Dec 2024 03:35:41 GMT
Title: Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings
Authors: Chunsheng Zuo, Pavel Guerzhoy, Michael Guerzhoy,
Abstract summary: We propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding.<n>We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens.
Score: 3.0559252110342703
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.

Related papers

Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding [32.01426831450348]
We show that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures. We demonstrate that Transformers without positional encoding can generate hierarchical languages.
arXiv Detail & Related papers (2024-10-16T09:56:01Z)
Improving Transformers using Faithful Positional Encoding [55.30212768657544]
We propose a new positional encoding method for a neural network architecture called the Transformer. Unlike the standard sinusoidal positional encoding, our approach has a guarantee of not losing information about the positional order of the input sequence.
arXiv Detail & Related papers (2024-05-15T03:17:30Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z)
Transformer Language Models without Positional Encodings Still Learn Positional Information [45.42248458957122]
We find that transformer language models without any explicit positional encoding are still competitive with standard models. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
arXiv Detail & Related papers (2022-03-30T19:37:07Z)
Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features. Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z)
Do We Really Need Explicit Position Encodings for Vision Transformers? [29.7662570764424]
We propose a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. Our new model with PEG is named Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings.
arXiv Detail & Related papers (2021-02-22T10:29:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.