Related papers: Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions

Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions

URL: http://arxiv.org/abs/2509.24914v1
Date: Mon, 29 Sep 2025 15:19:31 GMT
Title: Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions
Authors: Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, Lenka Zdeborová,
Abstract summary: We study empirical risk in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks.<n>We derive sharps for training and test errors, locate weights and recovery thresholds, and characterize the limiting spectral distribution of learned weights.
Score: 26.597272916325537
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks, given by the recently introduced attention-indexed model. Using tools from random matrix theory, spin-glass physics, and approximate message passing, we derive sharp asymptotics for training and test errors, locate interpolation and recovery thresholds, and characterize the limiting spectral distribution of the learned weights. Weight decay induces an implicit nuclear-norm regularization, favoring low-rank query and key matrices. Leveraging this, we compare the standard factorized training of query and key matrices with a direct parameterization in which their product is trained element-wise, revealing the inductive bias introduced by the factorized form. Remarkably, the predicted spectral distribution echoes empirical trends reported in large-scale transformers, offering a theoretical perspective consistent with these phenomena.

Related papers

TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z)
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix [13.866041299126207]
Self-attention layers have become fundamental building blocks of modern deep neural networks.<n>We provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention.
arXiv Detail & Related papers (2025-10-08T06:13:42Z)
Models of Heavy-Tailed Mechanistic Universality [62.107333654304014]
We propose a family of random matrix models to explore attributes that give rise to heavy-tailed behavior in trained neural networks.<n>Under this model, spectral densities with power laws on tails arise through a combination of three independent factors.<n> Implications of our model on other appearances of heavy tails, including neural scaling laws, trajectories, and the five-plus-one phases of neural network training, are discussed.
arXiv Detail & Related papers (2025-06-04T00:55:01Z)
High-Dimensional Kernel Methods under Covariate Shift: Data-Dependent Implicit Regularization [83.06112052443233]
This paper studies kernel ridge regression in high dimensions under covariate shifts. By a bias-variance decomposition, we theoretically demonstrate that the re-weighting strategy allows for decreasing the variance. For bias, we analyze the regularization of the arbitrary or well-chosen scale, showing that the bias can behave very differently under different regularization scales.
arXiv Detail & Related papers (2024-06-05T12:03:27Z)
Entrywise error bounds for low-rank approximations of kernel matrices [55.524284152242096]
We derive entrywise error bounds for low-rank approximations of kernel matrices obtained using the truncated eigen-decomposition. A key technical innovation is a delocalisation result for the eigenvectors of the kernel matrix corresponding to small eigenvalues. We validate our theory with an empirical study of a collection of synthetic and real-world datasets.
arXiv Detail & Related papers (2024-05-23T12:26:25Z)
Connectivity Shapes Implicit Regularization in Matrix Factorization Models for Matrix Completion [2.8948274245812335]
We investigate the implicit regularization of matrix factorization for solving matrix completion problems.<n>We empirically discover that the connectivity of observed data plays a crucial role in the implicit bias.<n>Our work reveals the intricate interplay between data connectivity, training dynamics, and implicit regularization in matrix factorization models.
arXiv Detail & Related papers (2024-05-22T15:12:14Z)
Hermitian stochastic methodology for X-ray superfluorescence [0.0]
A recently introduced theoretical framework for modeling the dynamics of X-rayaxial spontaneous emission is based on sampling of the density matrix of quantum emitters and the radiation field. While based on first principles and providing valuable theoretical insights, the original differential equations exhibit divergences and numerical instabilities. Here, we resolve this issue by accounting the components perturbatively.
arXiv Detail & Related papers (2024-02-06T15:16:27Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks [8.30897399932868]
Key finding indicates that the generalization performance of a neural network is associated with the degree of heavy tails in the spectrum of its weight matrices. We introduce a novel regularization technique, termed Heavy-Tailed Regularization, which explicitly promotes a more heavy-tailed spectrum in the weight matrix through regularization. We empirically show that heavytailed regularization outperforms conventional regularization techniques in terms of generalization performance.
arXiv Detail & Related papers (2023-04-06T07:50:14Z)
Fluctuations, Bias, Variance & Ensemble of Learners: Exact Asymptotics for Convex Losses in High-Dimension [25.711297863946193]
We develop a theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features. We provide a complete description of the joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit.
arXiv Detail & Related papers (2022-01-31T17:44:58Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
Understanding Implicit Regularization in Over-Parameterized Single Index Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model. We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.