Related papers: Improve Transformer Models with Better Relative Position Embeddings

Improve Transformer Models with Better Relative Position Embeddings

URL: http://arxiv.org/abs/2009.13658v1
Date: Mon, 28 Sep 2020 22:18:58 GMT
Title: Improve Transformer Models with Better Relative Position Embeddings
Authors: Zhiheng Huang, Davis Liang, Peng Xu, Bing Xiang
Abstract summary: Transformer architectures rely on explicit position encodings to preserve a notion of word order. We argue that existing work does not fully utilize position information. We propose new techniques that encourage increased interaction between query, key and relative position embeddings.
Score: 18.59434691153783
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer architectures rely on explicit position encodings in order to preserve a notion of word order. In this paper, we argue that existing work does not fully utilize position information. For example, the initial proposal of a sinusoid embedding is fixed and not learnable. In this paper, we first review absolute position embeddings and existing methods for relative position embeddings. We then propose new techniques that encourage increased interaction between query, key and relative position embeddings in the self-attention mechanism. Our most promising approach is a generalization of the absolute position embedding, improving results on SQuAD1.1 compared to previous position embeddings approaches. In addition, we address the inductive property of whether a position embedding can be robust enough to handle long sequences. We demonstrate empirically that our relative position embedding method is reasonably generalized and robust from the inductive perspective. Finally, we show that our proposed method can be adopted as a near drop-in replacement for improving the accuracy of large models with a small computational budget.

Related papers

SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
Transformers rely on both content-based and position-based addressing mechanisms to make predictions. TAPE is a novel framework that enhances positional embeddings by incorporating sequence content across layers. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z)
Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs) Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
A Frustratingly Easy Improvement for Position Embeddings via Random Padding [68.75670223005716]
In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to existing pre-trained language models. Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions.
arXiv Detail & Related papers (2023-05-08T17:08:14Z)
The Curious Case of Absolute Position Embeddings [65.13827063579728]
Transformer language models encode the notion of word order using positional information. In natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been investigated. We observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information.
arXiv Detail & Related papers (2022-10-23T00:00:04Z)
Learning Positional Embeddings for Coordinate-MLPs [37.56813817513575]
We develop a generic framework to learn the positional embedding based on the classic graph-Laplacian regularization. We show that the proposed embedding achieves better performance with higher stability compared to the well-established random Fourier features.
arXiv Detail & Related papers (2021-12-21T23:23:33Z)
Multiplicative Position-aware Transformer Models for Language Understanding [17.476450946279037]
Transformer models, which leverage architectural improvements like self-attention, perform remarkably well on Natural Language Processing (NLP) tasks. In this paper, we review major existing position embedding methods and compare their accuracy on downstream NLP tasks. We also propose a novel multiplicative embedding method which leads to superior accuracy when compared to existing methods.
arXiv Detail & Related papers (2021-09-27T04:18:32Z)
CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings [33.87449556591022]
We propose an augmentation-based approach (CAPE) for absolute positional embeddings. CAPE keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization)
arXiv Detail & Related papers (2021-06-06T14:54:55Z)
The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models [11.148662334602639]
We analyze the position embeddings of existing language models and find strong evidence of translation invariance. We propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion.
arXiv Detail & Related papers (2021-06-03T15:56:26Z)
TFPose: Direct Human Pose Estimation with Transformers [83.03424247905869]
We formulate the pose estimation task into a sequence prediction problem that can effectively be solved by transformers. Our framework is simple and direct, bypassing the drawbacks of the heatmap-based pose estimation. Experiments on the MS-COCO and MPII datasets demonstrate that our method can significantly improve the state-of-the-art of regression-based pose estimation.
arXiv Detail & Related papers (2021-03-29T04:18:54Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.