Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality
- URL: http://arxiv.org/abs/2602.01863v1
- Date: Mon, 02 Feb 2026 09:34:17 GMT
- Title: Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality
- Authors: Ryotaro Kawata, Taiji Suzuki,
- Abstract summary: Transformers excel through content-addressable retrieval and the ability to exploit contexts, in principle, length.<n>We recast associative memory at the level of probability measures, treating a context as a distribution over unbounded tokens.<n>We show that a shallow measure-theoretic Transformer learns the recall-and-predict map under a spectral assumption on the input densities.
- Score: 52.424255020469595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $ν= I^{-1} \sum_{i=1}^I μ^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $μ^{(i^*)}$ and (ii) prediction from $(μ_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.
Related papers
- Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets [8.844802588836059]
This paper is the first work proving that standard Transformers can approximate Hlder functions.<n>By introducing two metrics: the size and the dimension vector, we provide a fine-grained characterization of Transformer structures.
arXiv Detail & Related papers (2026-02-24T05:14:01Z) - Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z) - Singular Bayesian Neural Networks [1.2891210250935148]
Bayesian neural networks promise calibrated uncertainty but require $O(mn)$ parameters for standard mean-field Gaussian posteriors.<n>We induce a posterior that is singular with respect to the Lebesgue measure, concentrating on the rank-$r$ manifold.<n>We derive PAC-Bayes generalization bounds whose complexity term scales as $sqrtr(m+n)$ instead of $sqrtm n$, and prove loss bounds that decompose the error into optimization and rank-induced bias.
arXiv Detail & Related papers (2026-01-30T23:06:34Z) - Phase-space entropy at acquisition reflects downstream learnability [54.4100065023873]
We propose an acquisition-level scalar $S_mathcal B$ based on instrument-resolved phase space.<n>We show theoretically that (S_mathcal B) correctly identifies the phase-space coherence of periodic sampling.<n>$|S_mathcal B|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty emphwithout training.
arXiv Detail & Related papers (2025-12-22T10:03:51Z) - Approximation Bounds for Transformer Networks with Application to Regression [9.549045683389085]
We explore the approximation capabilities of Transformer networks for H"older and Sobolev functions.<n>We establish novel upper bounds for standard Transformer networks approxing sequence-to-sequence mappings.<n>We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence H"older functions.
arXiv Detail & Related papers (2025-04-16T15:25:58Z) - Exact Sequence Interpolation with Transformers [0.0]
We prove that transformers can exactly interpolate datasets of finite input sequences in $mathbbRd$, $dgeq 2$, with corresponding output sequences of smaller or equal length.<n>Specifically, given $N$ sequences of arbitrary but finite lengths in $mathbbRd$ and output sequences of lengths $m1, dots, mN in mathcalN$, we construct a transformer with $mathcalO(sum_j=1N mj)$ blocks and $
arXiv Detail & Related papers (2025-02-04T12:31:00Z) - Precise Asymptotics of Bagging Regularized M-estimators [20.077783679095443]
We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators.<n>Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples.<n>Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
arXiv Detail & Related papers (2024-09-23T17:48:28Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.