Related papers: Randomized Positional Encodings Boost Length Generalization of Transformers

Randomized Positional Encodings Boost Length Generalization of Transformers

URL: http://arxiv.org/abs/2305.16843v1
Date: Fri, 26 May 2023 11:47:52 GMT
Title: Randomized Positional Encodings Boost Length Generalization of Transformers
Authors: Anian Ruoss, Gr\'egoire Del\'etang, Tim Genewein, Jordi Grau-Moya, R\'obert Csord\'as, Mehdi Bennani, Shane Legg, Joel Veness
Abstract summary: Transformers have impressive generalization capabilities on tasks with a fixed context length. They fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. We introduce a novel family of positional encodings that can overcome this problem.
Score: 14.814408238614165
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have impressive generalization capabilities on tasks with a fixed context length. However, they fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. Moreover, simply training on longer sequences is inefficient due to the quadratic computation complexity of the global attention mechanism. In this work, we demonstrate that this failure mode is linked to positional encodings being out-of-distribution for longer sequences (even for relative encodings) and introduce a novel family of positional encodings that can overcome this problem. Concretely, our randomized positional encoding scheme simulates the positions of longer sequences and randomly selects an ordered subset to fit the sequence's length. Our large-scale empirical evaluation of 6000 models across 15 algorithmic reasoning tasks shows that our method allows Transformers to generalize to sequences of unseen length (increasing test accuracy by 12.0% on average).

Related papers

Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks [32.81985604969825]
We show that Transformers fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text. We propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings.
arXiv Detail & Related papers (2024-06-04T02:00:07Z)
Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding. We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z)
Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences. Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps. We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z)
LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z)
Hyena Hierarchy: Towards Larger Convolutional Language Models [115.82857881546089]
Hyena is a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods.
arXiv Detail & Related papers (2023-02-21T18:29:25Z)
A Non-monotonic Self-terminating Language Model [62.93465126911921]
In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-$k$ sampling, and nucleus sampling. We then propose a non-monotonic self-terminating language model, which relaxes the constraint of monotonically increasing termination probability.
arXiv Detail & Related papers (2022-10-03T00:28:44Z)
Sequence Length is a Domain: Length-based Overfitting in Transformer Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches. We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z)
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L0penalty. We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)
Consistency of a Recurrent Language Model With Respect to Incomplete Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model. We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.