Related papers: Position Interpolation Improves ALiBi Extrapolation

Position Interpolation Improves ALiBi Extrapolation

URL: http://arxiv.org/abs/2310.13017v1
Date: Wed, 18 Oct 2023 16:41:47 GMT
Title: Position Interpolation Improves ALiBi Extrapolation
Authors: Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, Joel Hestness
Abstract summary: We propose using linear position to extend the extrapolation range models using Attention with Linear Biases (ALiBi) We find position significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks.
Score: 2.1454660086411796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linear position interpolation helps pre-trained models using rotary position embeddings (RoPE) to extrapolate to longer sequence lengths. We propose using linear position interpolation to extend the extrapolation range of models using Attention with Linear Biases (ALiBi). We find position interpolation significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks.

Related papers

Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Extending Context Window of Large Language Models via Positional Interpolation [26.076599895589098]
We present Position Interpolation that extends the context window sizes of RoPE-based pretrained LLMs to up to 32768 with minimal fine-tuning (within 1000 steps) We demonstrate strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B.
arXiv Detail & Related papers (2023-06-27T16:26:26Z)
Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis [72.71398034617607]
We dissect a relative positional embedding design, ALiBi, via the lens of receptive field analysis. We modify the vanilla Sinusoidal positional embedding to create bftext, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence.
arXiv Detail & Related papers (2022-12-20T15:40:17Z)
Benign overfitting and adaptive nonparametric regression [71.70323672531606]
We construct an estimator which is a continuous function interpolating the data points with high probability. We attain minimax optimal rates under mean squared risk on the scale of H"older classes adaptively to the unknown smoothness.
arXiv Detail & Related papers (2022-06-27T14:50:14Z)
KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation [72.71398034617607]
KERPLE is a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way.
arXiv Detail & Related papers (2022-05-20T01:25:57Z)
On Optimal Interpolation In Linear Regression [22.310861786709538]
We show that the optimal way to interpolate in linear regression is to use functions that are linear in the response variable. We identify a regime where the minimum-norm interpolator provably generalizes arbitrarily worse than the optimal response-linear achievable interpolator. We extend the notion of optimal response-linear to random features regression under a linear data-generating model.
arXiv Detail & Related papers (2021-10-21T16:37:10Z)
Compressing Deep ODE-Nets using Basis Function Expansions [105.05435207079759]
We consider formulations of the weights as continuous-depth functions using linear combinations of basis functions. This perspective allows us to compress the weights through a change of basis, without retraining, while maintaining near state-of-the-art performance. In turn, both inference time and the memory footprint are reduced, enabling quick and rigorous adaptation between computational environments.
arXiv Detail & Related papers (2021-06-21T03:04:51Z)
Anti-Aliasing Add-On for Deep Prior Seismic Data Interpolation [20.336981948463702]
We propose to improve Deep Prior inversion by adding a directional Laplacian as regularization term to the problem. We show that our results are less prone to aliasing also in presence of noisy and corrupted data.
arXiv Detail & Related papers (2021-01-27T12:46:58Z)
Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation. We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component. Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.