Related papers: Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

URL: http://arxiv.org/abs/2602.01863v1
Date: Mon, 02 Feb 2026 09:34:17 GMT
Title: Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality
Authors: Ryotaro Kawata, Taiji Suzuki,
Abstract summary: Transformers excel through content-addressable retrieval and the ability to exploit contexts, in principle, length.<n>We recast associative memory at the level of probability measures, treating a context as a distribution over unbounded tokens.<n>We show that a shallow measure-theoretic Transformer learns the recall-and-predict map under a spectral assumption on the input densities.
Score: 52.424255020469595
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $ν= I^{-1} \sum_{i=1}^I μ^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $μ^{(i^*)}$ and (ii) prediction from $(μ_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.

Related papers

Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets [8.844802588836059]
This paper is the first work proving that standard Transformers can approximate Hlder functions.<n>By introducing two metrics: the size and the dimension vector, we provide a fine-grained characterization of Transformer structures.
arXiv Detail & Related papers (2026-02-24T05:14:01Z)
Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z)
Singular Bayesian Neural Networks [1.2891210250935148]
Bayesian neural networks promise calibrated uncertainty but require $O(mn)$ parameters for standard mean-field Gaussian posteriors.<n>We induce a posterior that is singular with respect to the Lebesgue measure, concentrating on the rank-$r$ manifold.<n>We derive PAC-Bayes generalization bounds whose complexity term scales as $sqrtr(m+n)$ instead of $sqrtm n$, and prove loss bounds that decompose the error into optimization and rank-induced bias.
arXiv Detail & Related papers (2026-01-30T23:06:34Z)
Phase-space entropy at acquisition reflects downstream learnability [54.4100065023873]
We propose an acquisition-level scalar $S_mathcal B$ based on instrument-resolved phase space.<n>We show theoretically that (S_mathcal B) correctly identifies the phase-space coherence of periodic sampling.<n>$|S_mathcal B|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty emphwithout training.
arXiv Detail & Related papers (2025-12-22T10:03:51Z)
Approximation Bounds for Transformer Networks with Application to Regression [9.549045683389085]
We explore the approximation capabilities of Transformer networks for H"older and Sobolev functions.<n>We establish novel upper bounds for standard Transformer networks approxing sequence-to-sequence mappings.<n>We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence H"older functions.
arXiv Detail & Related papers (2025-04-16T15:25:58Z)
Exact Sequence Interpolation with Transformers [0.0]
We prove that transformers can exactly interpolate datasets of finite input sequences in $mathbbRd$, $dgeq 2$, with corresponding output sequences of smaller or equal length.<n>Specifically, given $N$ sequences of arbitrary but finite lengths in $mathbbRd$ and output sequences of lengths $m1, dots, mN in mathcalN$, we construct a transformer with $mathcalO(sum_j=1N mj)$ blocks and $
arXiv Detail & Related papers (2025-02-04T12:31:00Z)
Precise Asymptotics of Bagging Regularized M-estimators [20.077783679095443]
We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators.<n>Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples.<n>Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
arXiv Detail & Related papers (2024-09-23T17:48:28Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP) We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.