Related papers: CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

URL: http://arxiv.org/abs/2106.03143v1
Date: Sun, 6 Jun 2021 14:54:55 GMT
Title: CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings
Authors: Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, Alex Rogozhnikov
Abstract summary: We propose an augmentation-based approach (CAPE) for absolute positional embeddings. CAPE keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization)
Score: 33.87449556591022
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Without positional information, attention-based transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed transformer models positional information. Absolute positional embeddings are simple to implement, but suffer from generalization issues when evaluating on sequences of different length than those seen at training time. Relative positions are more robust to length change, but are more complex to implement and yield inferior model throughput. In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

Related papers

SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
Transformers rely on both content-based and position-based addressing mechanisms to make predictions. TAPE is a novel framework that enhances positional embeddings by incorporating sequence content across layers. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
The Impact of Positional Encoding on Length Generalization in Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
The Curious Case of Absolute Position Embeddings [65.13827063579728]
Transformer language models encode the notion of word order using positional information. In natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been investigated. We observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information.
arXiv Detail & Related papers (2022-10-23T00:00:04Z)
Multiplicative Position-aware Transformer Models for Language Understanding [17.476450946279037]
Transformer models, which leverage architectural improvements like self-attention, perform remarkably well on Natural Language Processing (NLP) tasks. In this paper, we review major existing position embedding methods and compare their accuracy on downstream NLP tasks. We also propose a novel multiplicative embedding method which leads to superior accuracy when compared to existing methods.
arXiv Detail & Related papers (2021-09-27T04:18:32Z)
SHAPE: Shifted Absolute Position Embedding for Transformers [59.03597635990196]
Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues.
arXiv Detail & Related papers (2021-09-13T00:10:02Z)
Conformer-based End-to-end Speech Recognition With Rotary Position Embedding [11.428057887454008]
We introduce rotary position embedding (RoPE) in the convolution-augmented transformer (conformer) RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. Our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
arXiv Detail & Related papers (2021-07-13T08:07:22Z)
Improve Transformer Models with Better Relative Position Embeddings [18.59434691153783]
Transformer architectures rely on explicit position encodings to preserve a notion of word order. We argue that existing work does not fully utilize position information. We propose new techniques that encourage increased interaction between query, key and relative position embeddings.
arXiv Detail & Related papers (2020-09-28T22:18:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.