Randomized Positional Encodings Boost Length Generalization of
Transformers
- URL: http://arxiv.org/abs/2305.16843v1
- Date: Fri, 26 May 2023 11:47:52 GMT
- Title: Randomized Positional Encodings Boost Length Generalization of
Transformers
- Authors: Anian Ruoss, Gr\'egoire Del\'etang, Tim Genewein, Jordi Grau-Moya,
R\'obert Csord\'as, Mehdi Bennani, Shane Legg, Joel Veness
- Abstract summary: Transformers have impressive generalization capabilities on tasks with a fixed context length.
They fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string.
We introduce a novel family of positional encodings that can overcome this problem.
- Score: 14.814408238614165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have impressive generalization capabilities on tasks with a
fixed context length. However, they fail to generalize to sequences of
arbitrary length, even for seemingly simple tasks such as duplicating a string.
Moreover, simply training on longer sequences is inefficient due to the
quadratic computation complexity of the global attention mechanism. In this
work, we demonstrate that this failure mode is linked to positional encodings
being out-of-distribution for longer sequences (even for relative encodings)
and introduce a novel family of positional encodings that can overcome this
problem. Concretely, our randomized positional encoding scheme simulates the
positions of longer sequences and randomly selects an ordered subset to fit the
sequence's length. Our large-scale empirical evaluation of 6000 models across
15 algorithmic reasoning tasks shows that our method allows Transformers to
generalize to sequences of unseen length (increasing test accuracy by 12.0% on
average).
Related papers
- Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks [32.81985604969825]
We show that Transformers fail to generalize over length on basic arithmetic tasks such as addition and multiplication.
A major reason behind this failure is the vast difference in structure between numbers and text.
We propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings.
arXiv Detail & Related papers (2024-06-04T02:00:07Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences.
Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps.
We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z) - LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens.
Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z) - Hyena Hierarchy: Towards Larger Convolutional Language Models [115.82857881546089]
Hyena is a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.
In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods.
arXiv Detail & Related papers (2023-02-21T18:29:25Z) - A Non-monotonic Self-terminating Language Model [62.93465126911921]
In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm.
We first define an incomplete probable decoding algorithm which includes greedy search, top-$k$ sampling, and nucleus sampling.
We then propose a non-monotonic self-terminating language model, which relaxes the constraint of monotonically increasing termination probability.
arXiv Detail & Related papers (2022-10-03T00:28:44Z) - Sequence Length is a Domain: Length-based Overfitting in Transformer
Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z) - Consistency of a Recurrent Language Model With Respect to Incomplete
Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model.
We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.