CAPE: Encoding Relative Positions with Continuous Augmented Positional
Embeddings
- URL: http://arxiv.org/abs/2106.03143v1
- Date: Sun, 6 Jun 2021 14:54:55 GMT
- Title: CAPE: Encoding Relative Positions with Continuous Augmented Positional
Embeddings
- Authors: Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve,
Alex Rogozhnikov
- Abstract summary: We propose an augmentation-based approach (CAPE) for absolute positional embeddings.
CAPE keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization)
- Score: 33.87449556591022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Without positional information, attention-based transformer neural networks
are permutation-invariant. Absolute or relative positional embeddings are the
most popular ways to feed transformer models positional information. Absolute
positional embeddings are simple to implement, but suffer from generalization
issues when evaluating on sequences of different length than those seen at
training time. Relative positions are more robust to length change, but are
more complex to implement and yield inferior model throughput. In this paper,
we propose an augmentation-based approach (CAPE) for absolute positional
embeddings, which keeps the advantages of both absolute (simplicity and speed)
and relative position embeddings (better generalization). In addition, our
empirical evaluation on state-of-the-art models in machine translation, image
and speech recognition demonstrates that CAPE leads to better generalization
performance as well as increased stability with respect to training
hyper-parameters.
Related papers
- Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - The Impact of Positional Encoding on Length Generalization in
Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches.
Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z) - Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - The Curious Case of Absolute Position Embeddings [65.13827063579728]
Transformer language models encode the notion of word order using positional information.
In natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been investigated.
We observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information.
arXiv Detail & Related papers (2022-10-23T00:00:04Z) - Multiplicative Position-aware Transformer Models for Language
Understanding [17.476450946279037]
Transformer models, which leverage architectural improvements like self-attention, perform remarkably well on Natural Language Processing (NLP) tasks.
In this paper, we review major existing position embedding methods and compare their accuracy on downstream NLP tasks.
We also propose a novel multiplicative embedding method which leads to superior accuracy when compared to existing methods.
arXiv Detail & Related papers (2021-09-27T04:18:32Z) - SHAPE: Shifted Absolute Position Embedding for Transformers [59.03597635990196]
Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost.
We investigate shifted absolute position embedding (SHAPE) to address both issues.
arXiv Detail & Related papers (2021-09-13T00:10:02Z) - Conformer-based End-to-end Speech Recognition With Rotary Position
Embedding [11.428057887454008]
We introduce rotary position embedding (RoPE) in the convolution-augmented transformer (conformer)
RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module.
Our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
arXiv Detail & Related papers (2021-07-13T08:07:22Z) - Improve Transformer Models with Better Relative Position Embeddings [18.59434691153783]
Transformer architectures rely on explicit position encodings to preserve a notion of word order.
We argue that existing work does not fully utilize position information.
We propose new techniques that encourage increased interaction between query, key and relative position embeddings.
arXiv Detail & Related papers (2020-09-28T22:18:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.