Related papers: KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

URL: http://arxiv.org/abs/2205.09921v1
Date: Fri, 20 May 2022 01:25:57 GMT
Title: KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation
Authors: Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, Alexander I. Rudnicky
Abstract summary: KERPLE is a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way.
Score: 72.71398034617607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.

Related papers

MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation [5.298814565953444]
Relative position encoding methods address the length extrapolation challenge exclusively through the implementation of a single kernel function. This study proposes a novel relative positional encoding method, called MEP, which employs a weighted average to combine distinct kernel functions. We present two distinct versions of our method: a parameter-free variant that requires no new learnable parameters, and a parameterized variant capable of integrating state-of-the-art techniques.
arXiv Detail & Related papers (2024-03-26T13:38:06Z)
Kernel Random Projection Depth for Outlier Detection [0.0]
This paper proposes an extension of Random Depth Curve (RPD) datasets to cope with multiple modalities and non-ROCity on data clouds. In the proposed method, the RPD is computed in the framework of a reproducing space.
arXiv Detail & Related papers (2023-06-12T12:05:54Z)
Revisiting Memory Efficient Kernel Approximation: An Indefinite Learning Perspective [0.8594140167290097]
Matrix approximations are a key element in large-scale machine learning approaches. We extend MEKA to be applicable not only for shift-invariant kernels but also for non-stationary kernels. We present a Lanczos-based estimation of a spectrum shift to develop a stable positive semi-definite MEKA approximation.
arXiv Detail & Related papers (2021-12-18T10:01:34Z)
Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)
Deep Deterministic Uncertainty for Semantic Segmentation [97.89295891304394]
We extend Deep Deterministic Uncertainty (DDU) to semantic segmentation. We show that DDU improves upon MC Dropout and Deep Ensembles while being significantly faster to compute.
arXiv Detail & Related papers (2021-10-29T20:45:58Z)
Scalable Variational Gaussian Processes via Harmonic Kernel Decomposition [54.07797071198249]
We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability. We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections. Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
arXiv Detail & Related papers (2021-06-10T18:17:57Z)
Towards Unbiased Random Features with Lower Variance For Stationary Indefinite Kernels [26.57122949130266]
Our algorithm achieves lower variance and approximation error compared with the existing kernel approximation methods. With better approximation to the originally selected kernels, improved classification accuracy and regression ability is obtained.
arXiv Detail & Related papers (2021-04-13T13:56:50Z)
Fast Learning in Reproducing Kernel Krein Spaces via Signed Measures [31.986482149142503]
We cast this question as a distribution view by introducing the emphsigned measure A series of non-PD kernels can be associated with the linear combination of specific finite Borel measures. Specifically, this solution is also computationally implementable in practice to scale non-PD kernels in large sample cases.
arXiv Detail & Related papers (2020-05-30T12:10:35Z)
SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)
Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nystr\"om method [76.73096213472897]
We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees. Our approach leads to significantly better bounds for datasets with known rates of singular value decay. We show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
arXiv Detail & Related papers (2020-02-21T00:43:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.