Dissecting Transformer Length Extrapolation via the Lens of Receptive
Field Analysis
- URL: http://arxiv.org/abs/2212.10356v2
- Date: Tue, 23 May 2023 21:18:09 GMT
- Title: Dissecting Transformer Length Extrapolation via the Lens of Receptive
Field Analysis
- Authors: Ta-Chung Chi and Ting-Han Fan and Alexander I. Rudnicky and Peter J.
Ramadge
- Abstract summary: We dissect a relative positional embedding design, ALiBi, via the lens of receptive field analysis.
We modify the vanilla Sinusoidal positional embedding to create bftext, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence.
- Score: 72.71398034617607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Length extrapolation permits training a transformer language model on short
sequences that preserves perplexities when tested on substantially longer
sequences. A relative positional embedding design, ALiBi, has had the widest
usage to date. We dissect ALiBi via the lens of receptive field analysis
empowered by a novel cumulative normalized gradient tool. The concept of
receptive field further allows us to modify the vanilla Sinusoidal positional
embedding to create ~\textbf{Sandwich}, the first parameter-free relative
positional embedding design that truly length information uses longer than the
training sequence. Sandwich shares with KERPLE and T5 the same logarithmic
decaying temporal bias pattern with learnable relative positional embeddings;
these elucidate future extrapolatable positional embedding design.
Related papers
- Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding [40.289596031245374]
All Transformer-based models including large language models (LLMs) suffer from a preset length limit.
Numerous methods have emerged to enhance the length extrapolation of Transformers.
This survey aims to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.
arXiv Detail & Related papers (2023-12-28T14:42:24Z) - Attention Alignment and Flexible Positional Embeddings Improve
Transformer Length Extrapolation [61.305218287797025]
An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning.
We find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns.
We propose two attention alignment strategies via temperature scaling to alleviate the issue.
arXiv Detail & Related papers (2023-11-01T17:43:35Z) - Position Interpolation Improves ALiBi Extrapolation [2.1454660086411796]
We propose using linear position to extend the extrapolation range models using Attention with Linear Biases (ALiBi)
We find position significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks.
arXiv Detail & Related papers (2023-10-18T16:41:47Z) - Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - Causal Transformer for Estimating Counterfactual Outcomes [18.640006398066188]
Estimating counterfactual outcomes over time from observational data is relevant for many applications.
We develop a novel Causal Transformer for estimating counterfactual outcomes over time.
Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders.
arXiv Detail & Related papers (2022-04-14T22:40:09Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - Deriving Differential Target Propagation from Iterating Approximate
Inverses [91.3755431537592]
We show that a particular form of target propagation, relying on learned inverses of each layer, which is differential, gives rise to an update rule which corresponds to an approximate Gauss-Newton gradient-based optimization.
We consider several iterative calculations based on local auto-encoders at each layer in order to achieve more precise inversions for more accurate target propagation.
arXiv Detail & Related papers (2020-07-29T22:34:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.