Related papers: Sparse Attention as Compact Kernel Regression

Sparse Attention as Compact Kernel Regression

URL: http://arxiv.org/abs/2601.22766v2
Date: Wed, 04 Feb 2026 12:26:29 GMT
Title: Sparse Attention as Compact Kernel Regression
Authors: Saul Santos, Nuno Gonçalves, Daniel C. McNamee, Marcos Treviso, André F. T Martins,
Abstract summary: kernel-theoretic understanding of sparse attention mechanisms is currently missing.<n>We establish a formal correspondence between sparse attention and compact (bounded support) kernels.<n> Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance.
Score: 20.026224027434974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $α$-entmax attention with $α= 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

Related papers

Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z)
Spectral Truncation Kernels: Noncommutativity in $C^*$-algebraic Kernel Machines [12.11705128358537]
We propose a new class of positive definite kernels based on the spectral truncation.<n>We show that the proposed kernels fill the gap between existing separable and commutative kernels.<n>The flexibility of the proposed class of kernels allows us to go beyond previous separable and commutative kernels.
arXiv Detail & Related papers (2024-05-28T04:47:12Z)
Predicting Open-Hole Laminates Failure Using Support Vector Machines With Classical and Quantum Kernels [2.0039767863372506]
We show how to train surrogate models to learn the ultimate failure envelope of an open hole composite plate under in-plane loading. Thanks to kernel-target alignment optimization, we tune the free parameters of all kernels to best separate safe and failure-inducing loading states.
arXiv Detail & Related papers (2024-05-05T11:48:50Z)
Generalization in Kernel Regression Under Realistic Assumptions [41.345620270267446]
We provide rigorous bounds for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples. Our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression. As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.
arXiv Detail & Related papers (2023-12-26T10:55:20Z)
Kernel Subspace and Feature Extraction [7.424262881242935]
We study kernel methods in machine learning from the perspective of feature subspace. We construct a kernel from Hirschfeld--Gebelein--R'enyi maximal correlation functions, coined the maximal correlation kernel, and demonstrate its information-theoretic optimality.
arXiv Detail & Related papers (2023-01-04T02:46:11Z)
Reconstructing Kernel-based Machine Learning Force Fields with Super-linear Convergence [0.18416014644193063]
We consider the broad class of Nystr"om-type methods to construct preconditioners. All considered methods aim to identify a representative subset of inducing ( Kernel) columns to approximate the dominant kernel spectrum.
arXiv Detail & Related papers (2022-12-24T13:45:50Z)
Meta-Learning Hypothesis Spaces for Sequential Decision-making [79.73213540203389]
We propose to meta-learn a kernel from offline data (Meta-KeL) Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
arXiv Detail & Related papers (2022-02-01T17:46:51Z)
Hybrid Random Features [60.116392415715275]
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) HRFs automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest.
arXiv Detail & Related papers (2021-10-08T20:22:59Z)
Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models. This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models. We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z)
Scalable Variational Gaussian Processes via Harmonic Kernel Decomposition [54.07797071198249]
We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability. We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections. Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
arXiv Detail & Related papers (2021-06-10T18:17:57Z)
Flow-based Kernel Prior with Application to Blind Super-Resolution [143.21527713002354]
Kernel estimation is generally one of the key problems for blind image super-resolution (SR) This paper proposes a normalizing flow-based kernel prior (FKP) for kernel modeling. Experiments on synthetic and real-world images demonstrate that the proposed FKP can significantly improve the kernel estimation accuracy.
arXiv Detail & Related papers (2021-03-29T22:37:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.