Train Short, Test Long: Attention with Linear Biases Enables Input
Length Extrapolation
- URL: http://arxiv.org/abs/2108.12409v1
- Date: Fri, 27 Aug 2021 17:35:06 GMT
- Title: Train Short, Test Long: Attention with Linear Biases Enables Input
Length Extrapolation
- Authors: Ofir Press, Noah A. Smith, Mike Lewis
- Abstract summary: We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation.
ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance.
We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048.
- Score: 62.51758040848735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the introduction of the transformer model by Vaswani et al. (2017), a
fundamental question remains open: how to achieve extrapolation at inference
time to longer sequences than seen during training? We first show that
extrapolation can be improved by changing the position representation method,
though we find that existing proposals do not allow efficient extrapolation. We
introduce a simple and efficient method, Attention with Linear Biases (ALiBi),
that allows for extrapolation. ALiBi does not add positional embeddings to the
word embeddings; instead, it biases the query-key attention scores with a term
that is proportional to their distance. We show that this method allows
training a 1.3 billion parameter model on input sequences of length 1024 that
extrapolates to input sequences of length 2048, achieving the same perplexity
as a sinusoidal position embedding model trained on inputs of length 2048, 11%
faster and using 11% less memory. ALiBi's inductive bias towards recency allows
it to outperform multiple strong position methods on the WikiText-103
benchmark. Finally, we provide analysis of ALiBi to understand why it leads to
better performance.
Related papers
- On the Inductive Bias of Stacking Towards Improving Reasoning [50.225873619537765]
We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%.
MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks.
We conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models.
arXiv Detail & Related papers (2024-09-27T17:58:21Z) - Finding Transformer Circuits with Edge Pruning [71.12127707678961]
We propose Edge Pruning as an effective and scalable solution to automated circuit discovery.
Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods.
Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on.
arXiv Detail & Related papers (2024-06-24T16:40:54Z) - Bayesian Online Natural Gradient (BONG) [9.800443064368467]
We propose a novel approach to sequential Bayesian inference based on variational Bayes (VB)
The key insight is that, in the online setting, we do not need to add the KL term to regularize to the prior.
We show empirically that our method outperforms other online VB methods in the non-conjugate setting.
arXiv Detail & Related papers (2024-05-30T04:27:36Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Improving Length-Generalization in Transformers via Task Hinting [42.95479331339189]
In particular, the performance of a transformer model trained on tasks up to a certain length drops sharply when applied to longer instances of the same problem.
This work proposes an approach based on task hinting towards addressing length generalization.
arXiv Detail & Related papers (2023-10-01T16:57:40Z) - Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention.
We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys.
We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z) - IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings.
We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data.
We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z) - Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.