Related papers: Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

URL: http://arxiv.org/abs/2510.06685v1
Date: Wed, 08 Oct 2025 06:13:42 GMT
Title: Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix
Authors: Tomohiro Hayase, Benoît Collins, Ryo Karakida,
Abstract summary: Self-attention layers have become fundamental building blocks of modern deep neural networks.<n>We provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention.
Score: 13.866041299126207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.

Related papers

Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z)
Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions [26.597272916325537]
We study empirical risk in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks.<n>We derive sharps for training and test errors, locate weights and recovery thresholds, and characterize the limiting spectral distribution of learned weights.
arXiv Detail & Related papers (2025-09-29T15:19:31Z)
Critical behavior of the Schwinger model via gauge-invariant VUMPS [0.0]
We study the lattice Schwinger model by combining the variational uniform matrix product state (VUMPS) algorithm with a gauge-invariant matrix product ansatz.<n>We analyze the scaling in the simultaneous critical and limits continuum and confirm that the data collapse aligns with the Ising class to remarkable precision.
arXiv Detail & Related papers (2024-12-04T18:59:18Z)
High-Dimensional Kernel Methods under Covariate Shift: Data-Dependent Implicit Regularization [83.06112052443233]
This paper studies kernel ridge regression in high dimensions under covariate shifts. By a bias-variance decomposition, we theoretically demonstrate that the re-weighting strategy allows for decreasing the variance. For bias, we analyze the regularization of the arbitrary or well-chosen scale, showing that the bias can behave very differently under different regularization scales.
arXiv Detail & Related papers (2024-06-05T12:03:27Z)
Statistical physics of principal minors: Cavity approach [0.0]
We compute the sum of powers of principal minors of a matrix. This is relevant to the study of critical behaviors in quantum fermionic systems. We show that no (finite-temperature) phase transition is observed in this class of diagonally dominant matrices.
arXiv Detail & Related papers (2024-05-30T10:09:49Z)
Entrywise error bounds for low-rank approximations of kernel matrices [55.524284152242096]
We derive entrywise error bounds for low-rank approximations of kernel matrices obtained using the truncated eigen-decomposition. A key technical innovation is a delocalisation result for the eigenvectors of the kernel matrix corresponding to small eigenvalues. We validate our theory with an empirical study of a collection of synthetic and real-world datasets.
arXiv Detail & Related papers (2024-05-23T12:26:25Z)
Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction [55.57072563835959]
We propose an eigenvalue correction strategy that can free filters from the constraints of repeated eigenvalue inputs.<n>Concretely, the proposed eigenvalue correction strategy enhances the uniform distribution of eigenvalues, and improves the fitting capacity and expressive power of filters.
arXiv Detail & Related papers (2024-01-28T08:12:00Z)
Quantum tomography of helicity states for general scattering processes [55.2480439325792]
Quantum tomography has become an indispensable tool in order to compute the density matrix $rho$ of quantum systems in Physics. We present the theoretical framework for reconstructing the helicity quantum initial state of a general scattering process.
arXiv Detail & Related papers (2023-10-16T21:23:42Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks [8.30897399932868]
Key finding indicates that the generalization performance of a neural network is associated with the degree of heavy tails in the spectrum of its weight matrices. We introduce a novel regularization technique, termed Heavy-Tailed Regularization, which explicitly promotes a more heavy-tailed spectrum in the weight matrix through regularization. We empirically show that heavytailed regularization outperforms conventional regularization techniques in terms of generalization performance.
arXiv Detail & Related papers (2023-04-06T07:50:14Z)
Spectral Regularization: an Inductive Bias for Sequence Modeling [7.365884062005811]
This paper presents a spectral regularization technique, which attaches a unique inductive bias to sequence modeling. From fundamental connections between Hankel matrices and regular grammars, we propose to use the trace norm of the Hankel matrix, the tightest convex relaxation of its rank, as the spectral regularizer.
arXiv Detail & Related papers (2022-11-04T04:07:05Z)
Relative Error Bound Analysis for Nuclear Norm Regularized Matrix Completion [101.83262280224729]
We develop a relative error bound for nuclear norm regularized matrix completion. We derive a relative upper bound for recovering the best low-rank approximation of the unknown matrix.
arXiv Detail & Related papers (2015-04-26T13:12:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.