Related papers: Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

URL: http://arxiv.org/abs/2212.10356v2
Date: Tue, 23 May 2023 21:18:09 GMT
Title: Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis
Authors: Ta-Chung Chi and Ting-Han Fan and Alexander I. Rudnicky and Peter J. Ramadge
Abstract summary: We dissect a relative positional embedding design, ALiBi, via the lens of receptive field analysis. We modify the vanilla Sinusoidal positional embedding to create bftext, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence.
Score: 72.71398034617607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to date. We dissect ALiBi via the lens of receptive field analysis empowered by a novel cumulative normalized gradient tool. The concept of receptive field further allows us to modify the vanilla Sinusoidal positional embedding to create ~\textbf{Sandwich}, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence. Sandwich shares with KERPLE and T5 the same logarithmic decaying temporal bias pattern with learnable relative positional embeddings; these elucidate future extrapolatable positional embedding design.

Related papers

Balancing long- and short-term dynamics for the modeling of saliency in videos [14.527351636175615]
We present a Transformer-based approach to learn a joint representation of video frames and past saliency information. Our model embeds long- and short-term information to detect dynamically shifting saliency in video.
arXiv Detail & Related papers (2025-04-08T11:09:37Z)
Wavelet-based Positional Representation for Long Context [14.902305283428642]
We analyze conventional position encoding methods for long contexts. We propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms. Experimental results show that this new method improves the performance of the model in both short and long contexts.
arXiv Detail & Related papers (2025-02-04T04:44:53Z)
Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding [40.289596031245374]
All Transformer-based models including large language models (LLMs) suffer from a preset length limit. Numerous methods have emerged to enhance the length extrapolation of Transformers. This survey aims to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.
arXiv Detail & Related papers (2023-12-28T14:42:24Z)
Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation [61.305218287797025]
An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. We find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. We propose two attention alignment strategies via temperature scaling to alleviate the issue.
arXiv Detail & Related papers (2023-11-01T17:43:35Z)
Position Interpolation Improves ALiBi Extrapolation [2.1454660086411796]
We propose using linear position to extend the extrapolation range models using Attention with Linear Biases (ALiBi) We find position significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks.
arXiv Detail & Related papers (2023-10-18T16:41:47Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA) DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z)
Causal Transformer for Estimating Counterfactual Outcomes [18.640006398066188]
Estimating counterfactual outcomes over time from observational data is relevant for many applications. We develop a novel Causal Transformer for estimating counterfactual outcomes over time. Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders.
arXiv Detail & Related papers (2022-04-14T22:40:09Z)
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z)
Deriving Differential Target Propagation from Iterating Approximate Inverses [91.3755431537592]
We show that a particular form of target propagation, relying on learned inverses of each layer, which is differential, gives rise to an update rule which corresponds to an approximate Gauss-Newton gradient-based optimization. We consider several iterative calculations based on local auto-encoders at each layer in order to achieve more precise inversions for more accurate target propagation.
arXiv Detail & Related papers (2020-07-29T22:34:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.