A Length-Extrapolatable Transformer
- URL: http://arxiv.org/abs/2212.10554v1
- Date: Tue, 20 Dec 2022 18:56:20 GMT
- Title: A Length-Extrapolatable Transformer
- Authors: Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon
Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei
- Abstract summary: We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences.
We introduce a relative position embedding to explicitly maximize attention resolution.
We evaluate different Transformer variants with language modeling.
- Score: 98.54835576985664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Position modeling plays a critical role in Transformers. In this paper, we
focus on length extrapolation, i.e., training on short texts while evaluating
longer sequences. We define attention resolution as an indicator of
extrapolation. Then we propose two designs to improve the above metric of
Transformers. Specifically, we introduce a relative position embedding to
explicitly maximize attention resolution. Moreover, we use blockwise causal
attention during inference for better resolution. We evaluate different
Transformer variants with language modeling. Experimental results show that our
model achieves strong performance in both interpolation and extrapolation
settings. The code will be available at https://aka.ms/LeX-Transformer.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions [7.478336691707095]
State-of-the-art re-weighting functions rely heavily on target sequence lengths.
We propose Learned Proportions (LeaP) and LeaPformers.
arXiv Detail & Related papers (2024-05-18T22:23:07Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - When to Use Efficient Self Attention? Profiling Text, Speech and Image
Transformer Variants [39.00433193973159]
We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.
We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models.
To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model.
arXiv Detail & Related papers (2023-06-14T17:59:02Z) - Transformer-F: A Transformer network with effective methods for learning
universal sentence representation [8.225067988604351]
The Transformer model is widely used in natural language processing for sentence representation.
In this paper, two approaches are introduced to improve the performance of Transformers.
arXiv Detail & Related papers (2021-07-02T03:20:11Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.