Related papers: The Curious Case of Absolute Position Embeddings

The Curious Case of Absolute Position Embeddings

URL: http://arxiv.org/abs/2210.12574v1
Date: Sun, 23 Oct 2022 00:00:04 GMT
Title: The Curious Case of Absolute Position Embeddings
Authors: Koustuv Sinha, Amirhossein Kazemnejad, Siva Reddy, Joelle Pineau, Dieuwke Hupkes, Adina Williams
Abstract summary: Transformer language models encode the notion of word order using positional information. In natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been investigated. We observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information.
Score: 65.13827063579728
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer language models encode the notion of word order using positional information. Most commonly, this positional information is represented by absolute position embeddings (APEs), that are learned from the pretraining data. However, in natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been investigated. In this work, we observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information. Specifically, when models are subjected to sentences starting from a non-zero position (excluding the effect of priming), they exhibit noticeably degraded performance on zero to full-shot tasks, across a range of model families and model sizes. Our findings raise questions about the efficacy of APEs to model the relativity of position information, and invite further introspection on the sentence and word order processing strategies employed by these models.

Related papers

Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models [49.46335932942725]
We study how positional bias interacts with model uncertainty, syntax, and prompting.<n>We present a cross-linguistic study across five typologically distinct languages.
arXiv Detail & Related papers (2025-05-22T02:23:00Z)
Learning to Adapt to Position Bias in Vision Transformer Classifiers [10.210145452318041]
We show that position bias plays a crucial role in the performance of Vision Transformers image classifiers.<n>We show various levels of position bias in different datasets, and find that the optimal choice of position embedding depends on the position bias apparent in the dataset.
arXiv Detail & Related papers (2025-05-19T14:07:36Z)
Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse [54.08750245737734]
We propose that speakers modulate information rate based on location within a hierarchically-structured model of discourse. We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones.
arXiv Detail & Related papers (2024-10-21T14:42:37Z)
Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs) Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z)
Mitigate Position Bias in Large Language Models via Scaling a Single Dimension [47.792435921037274]
This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states.
arXiv Detail & Related papers (2024-06-04T17:55:38Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z)
Transformer Language Models without Positional Encodings Still Learn Positional Information [45.42248458957122]
We find that transformer language models without any explicit positional encoding are still competitive with standard models. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
arXiv Detail & Related papers (2022-03-30T19:37:07Z)
CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings [33.87449556591022]
We propose an augmentation-based approach (CAPE) for absolute positional embeddings. CAPE keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization)
arXiv Detail & Related papers (2021-06-06T14:54:55Z)
The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models [11.148662334602639]
We analyze the position embeddings of existing language models and find strong evidence of translation invariance. We propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion.
arXiv Detail & Related papers (2021-06-03T15:56:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.