Demystifying the Better Performance of Position Encoding Variants for
Transformer
- URL: http://arxiv.org/abs/2104.08698v1
- Date: Sun, 18 Apr 2021 03:44:57 GMT
- Title: Demystifying the Better Performance of Position Encoding Variants for
Transformer
- Authors: Pu-Chin Chen, Henry Tsai, Srinadh Bhojanapalli, Hyung Won Chung,
Yin-Wen Chang, Chun-Sung Ferng
- Abstract summary: We show how to encode position and segment into Transformer models.
The proposed method performs on par with SOTA on GLUE, XTREME and WMT benchmarks while saving costs.
- Score: 12.503079503907989
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are state of the art models in NLP that map a given input
sequence of vectors to an output sequence of vectors. However these models are
permutation equivariant, and additive position embeddings to the input are used
to supply the information about the order of the input tokens. Further, for
some tasks, additional additive segment embeddings are used to denote different
types of input sentences. Recent works proposed variations of positional
encodings with relative position encodings achieving better performance. In
this work, we do a systematic study comparing different position encodings and
understanding the reasons for differences in their performance. We demonstrate
a simple yet effective way to encode position and segment into the Transformer
models. The proposed method performs on par with SOTA on GLUE, XTREME and WMT
benchmarks while saving computation costs.
Related papers
- Improving Transformers using Faithful Positional Encoding [55.30212768657544]
We propose a new positional encoding method for a neural network architecture called the Transformer.
Unlike the standard sinusoidal positional encoding, our approach has a guarantee of not losing information about the positional order of the input sequence.
arXiv Detail & Related papers (2024-05-15T03:17:30Z) - Comparing Graph Transformers via Positional Encodings [11.5844121984212]
The distinguishing power of graph transformers is closely tied to the choice of positional encoding.
There are two primary types of positional encoding: absolute positional encodings (APEs) and relative positional encodings (RPEs)
We show that graph transformers using APEs and RPEs are equivalent in terms of distinguishing power.
arXiv Detail & Related papers (2024-02-22T01:07:48Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Improving Position Encoding of Transformers for Multivariate Time Series
Classification [5.467400475482668]
We propose a new absolute position encoding method dedicated to time series data called time Absolute Position.
We then propose a novel time series classification (MTSC) model combining tAPE/eRPE and convolution-based input encoding named ConvTran to improve the position and data embedding of time series data.
arXiv Detail & Related papers (2023-05-26T05:30:04Z) - Towards More Efficient Insertion Transformer with Fractional Positional
Encoding [44.45401243989363]
Auto-regressive neural sequence models have been shown to be effective across text generation tasks.
Their left-to-right decoding order prevents generation from being parallelized.
Insertion Transformer is an attractive alternative that allows outputting multiple tokens in a single generation step.
arXiv Detail & Related papers (2021-12-12T18:38:27Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Do We Really Need Explicit Position Encodings for Vision Transformers? [29.7662570764424]
We propose a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token.
Our new model with PEG is named Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length.
We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings.
arXiv Detail & Related papers (2021-02-22T10:29:55Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.