Related papers: Data-Aware Random Feature Kernel for Transformers

Data-Aware Random Feature Kernel for Transformers

URL: http://arxiv.org/abs/2603.04127v1
Date: Wed, 04 Mar 2026 14:43:11 GMT
Title: Data-Aware Random Feature Kernel for Transformers
Authors: Amirhossein Farzam, Hossein Mobahi, Nolan Andrew Miller, Luke Sernau,
Abstract summary: In pretrained models, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes.<n>We introduce DARKFormer, a Data-Aware Random-feature Kernel transformer that features a data-aligned kernel geometry.<n>By combining random-feature efficiency with data-aware kernels, DARKFormer advances kernel-based attention in resource-constrained settings.
Score: 7.455696124254561
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes unless one retrains the model or uses a large feature budget. Importance sampling can address this by adapting the sampling distribution to the input geometry, but complex data-dependent proposal distributions are often intractable. We show that by data aligning the softmax kernel, we obtain an attention mechanism which can both admit a tractable minimal-variance proposal distribution for importance sampling, and exhibits better training stability. Motivated by this finding, we introduce DARKFormer, a Data-Aware Random-feature Kernel transformer that features a data-aligned kernel geometry. DARKFormer learns the random-projection covariance, efficiently realizing an importance-sampled positive random-feature estimator for its data-aligned kernel. Empirically, DARKFormer narrows the performance gap with exact softmax attention, particularly in finetuning regimes where pretrained representations are anisotropic. By combining random-feature efficiency with data-aware kernels, DARKFormer advances kernel-based attention in resource-constrained settings.

Related papers

Scalable Random Wavelet Features: Efficient Non-Stationary Kernel Approximation with Convergence Guarantees [5.758073912084366]
We introduce Random Wavelet Features (RWF), a framework that constructs scalable, non-stationary kernel approximations by sampling from wavelet families.<n>By harnessing the inherent localization and multi-resolution structure of wavelets, RWF generates an explicit feature map that captures complex, input-dependent patterns.<n>We demonstrate empirically on a range of challenging synthetic and real-world datasets that RWF outperforms stationary random features and offers a compelling accuracy-efficiency trade-off.
arXiv Detail & Related papers (2026-02-01T02:56:56Z)
Scalable Gaussian Processes with Low-Rank Deep Kernel Decomposition [7.532273334759435]
Kernels are key to encoding prior beliefs and data structures in Gaussian process (GP) models.<n>Deep kernel learning enhances kernel flexibility by feeding inputs through a neural network before applying a standard parametric form.<n>We introduce a fully data-driven, scalable deep kernel representation where a neural network directly represents a low-rank kernel.
arXiv Detail & Related papers (2025-05-24T05:42:11Z)
Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction. We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z)
Score-based Diffusion Models in Function Space [137.70916238028306]
Diffusion models have recently emerged as a powerful framework for generative modeling.<n>This work introduces a mathematically rigorous framework called Denoising Diffusion Operators (DDOs) for training diffusion models in function space.<n>We show that the corresponding discretized algorithm generates accurate samples at a fixed cost independent of the data resolution.
arXiv Detail & Related papers (2023-02-14T23:50:53Z)
Simplex Random Features [53.97976744884616]
We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels. We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels. We show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
arXiv Detail & Related papers (2023-01-31T18:53:39Z)
Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding [51.48582649050054]
We propose a representation normalization method which aims at disentangling the correlations between features of encoded sentences. We also propose Kernel-Whitening, a Nystrom kernel approximation method to achieve more thorough debiasing on nonlinear spurious correlations. Experiments show that Kernel-Whitening significantly improves the performance of BERT on out-of-distribution datasets while maintaining in-distribution accuracy.
arXiv Detail & Related papers (2022-10-14T05:56:38Z)
Local Random Feature Approximations of the Gaussian Kernel [14.230653042112834]
We focus on the popular Gaussian kernel and on techniques to linearize kernel-based models by means of random feature approximations. We show that such approaches yield poor results when modelling high-frequency data, and we propose a novel localization scheme that improves kernel approximations and downstream performance significantly.
arXiv Detail & Related papers (2022-04-12T09:52:36Z)
Hybrid Random Features [60.116392415715275]
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) HRFs automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest.
arXiv Detail & Related papers (2021-10-08T20:22:59Z)
A Note on Optimizing Distributions using Kernel Mean Embeddings [94.96262888797257]
Kernel mean embeddings represent probability measures by their infinite-dimensional mean embeddings in a reproducing kernel Hilbert space. We show that when the kernel is characteristic, distributions with a kernel sum-of-squares density are dense. We provide algorithms to optimize such distributions in the finite-sample setting.
arXiv Detail & Related papers (2021-06-18T08:33:45Z)
Decentralised Learning with Random Features and Distributed Gradient Descent [39.00450514924611]
We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in a homogenous setting. We establish high probability bounds on the predictive performance for each agent as a function of the step size, number of iterations, inverse spectral gap of the communication matrix and number of Random Features. We present simulations that show how the number of Random Features, iterations and samples impact predictive performance.
arXiv Detail & Related papers (2020-07-01T09:55:09Z)
Scaling up Kernel Ridge Regression via Locality Sensitive Hashing [6.704115928005158]
We introduce a weighted version of random binning features and show that the corresponding kernel function generates smooth Gaussian processes. We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression.
arXiv Detail & Related papers (2020-03-21T21:41:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.