A Random Matrix Analysis of In-context Memorization for Nonlinear Attention
- URL: http://arxiv.org/abs/2506.18656v1
- Date: Mon, 23 Jun 2025 13:56:43 GMT
- Title: A Random Matrix Analysis of In-context Memorization for Nonlinear Attention
- Authors: Zhenyu Liao, Jiaqing Liu, TianQi Hou, Difan Zou, Zenan Ling,
- Abstract summary: We show that nonlinear Attention incurs higher memorization error than linear ridge regression on random inputs.<n>Our results reveal how nonlinearity and input structure interact with each other to govern the memorization performance of nonlinear Attention.
- Score: 18.90197287760915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention mechanisms have revolutionized machine learning (ML) by enabling efficient modeling of global dependencies across inputs. Their inherently parallelizable structures allow for efficient scaling with the exponentially increasing size of both pretrained data and model parameters. Yet, despite their central role as the computational backbone of modern large language models (LLMs), the theoretical understanding of Attentions, especially in the nonlinear setting, remains limited. In this paper, we provide a precise characterization of the \emph{in-context memorization error} of \emph{nonlinear Attention}, in the high-dimensional proportional regime where the number of input tokens $n$ and their embedding dimension $p$ are both large and comparable. Leveraging recent advances in the theory of large kernel random matrices, we show that nonlinear Attention typically incurs higher memorization error than linear ridge regression on random inputs. However, this gap vanishes, and can even be reversed, when the input exhibits statistical structure, particularly when the Attention weights align with the input signal direction. Our results reveal how nonlinearity and input structure interact with each other to govern the memorization performance of nonlinear Attention. The theoretical insights are supported by numerical experiments.
Related papers
- Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models [51.85815025140659]
Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data.<n>In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large gives rise to novel and sometimes counterintuitive behaviors.<n>This paper extends traditional Random Matrix Theory (RMT) beyond eigenvalue-based analysis of linear models to address the challenges posed by nonlinear ML models.
arXiv Detail & Related papers (2025-06-16T06:54:08Z) - Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z) - Tailored minimal reservoir computing: on the bidirectional connection between nonlinearities in the reservoir and in data [0.0]
We study how the degree of nonlinearity in the input data affects the optimal design of reservoir computers.<n>By reducing minimal RCs to a single tunable nonlinearity parameter, we explore how the predictive performance varies with the degree of nonlinearity in the reservoir.<n>Our experiments reveal that the prediction performance is maximized when the reservoir's nonlinearity matches the nonlinearity present in the data.
arXiv Detail & Related papers (2025-04-24T12:47:21Z) - Nonlinear Multiple Response Regression and Learning of Latent Spaces [2.6113259186042876]
We introduce a unified method capable of learning latent spaces in both unsupervised and supervised settings.<n>Unlike other neural network methods that operate as "black boxes", our approach not only offers better interpretability but also reduces computational complexity.
arXiv Detail & Related papers (2025-03-27T15:28:06Z) - Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Metric-Entropy Limits on the Approximation of Nonlinear Dynamical Systems [4.069144210024563]
We show that RNNs can approximate nonlinear systems that satisfy a Lipschitz property and forget past inputs fast enough.<n>As the sets of sequence-to-sequence mappings we consider are significantly more massive than function classes generally analyzed in approximation theory, a refined metric-entropy characterization is needed.
arXiv Detail & Related papers (2024-07-01T12:57:03Z) - Intrinsic Voltage Offsets in Memcapacitive Bio-Membranes Enable High-Performance Physical Reservoir Computing [0.0]
Reservoir computing is a brain-inspired machine learning framework for processing temporal data by mapping inputs into high-dimensional spaces.
Here, we introduce a novel memcapacitor-based PRC that exploits internal voltage offsets to enable both monotonic and non-monotonic input-state correlations.
Our approach and unprecedented performance are a major milestone towards high-performance full in-materia PRCs.
arXiv Detail & Related papers (2024-04-27T05:47:38Z) - Learning Linear Causal Representations from Interventions under General
Nonlinear Mixing [52.66151568785088]
We prove strong identifiability results given unknown single-node interventions without access to the intervention targets.
This is the first instance of causal identifiability from non-paired interventions for deep neural network embeddings.
arXiv Detail & Related papers (2023-06-04T02:32:12Z) - General stochastic separation theorems with optimal bounds [68.8204255655161]
Phenomenon of separability was revealed and used in machine learning to correct errors of Artificial Intelligence (AI) systems and analyze AI instabilities.
Errors or clusters of errors can be separated from the rest of the data.
The ability to correct an AI system also opens up the possibility of an attack on it, and the high dimensionality induces vulnerabilities caused by the same separability.
arXiv Detail & Related papers (2020-10-11T13:12:41Z) - Sparse Quantized Spectral Clustering [85.77233010209368]
We exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations.
We show that very little change occurs in the informative eigenstructure even under drastic sparsification/quantization.
arXiv Detail & Related papers (2020-10-03T15:58:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.