Related papers: Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

URL: http://arxiv.org/abs/2509.21936v1
Date: Fri, 26 Sep 2025 06:21:30 GMT
Title: Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
Authors: O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborová,
Abstract summary: We study the dominance of softmax over alternatives in large language models.<n>We show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short.<n>We discuss the connection with optimization by gradient-based algorithms.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.

Related papers

Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning [17.65459083031186]
We train dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression.<n>We show that a simple first-order gradient descent can converge to the globally optimal self-attention parameters.
arXiv Detail & Related papers (2026-03-02T06:44:54Z)
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective [16.076157672455867]
We develop a measure-based framework for studying single-layer softmax attention under both finite and infinite prompts.<n>For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure.
arXiv Detail & Related papers (2025-12-12T18:54:52Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [54.20763128054692]
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics.
arXiv Detail & Related papers (2024-02-29T18:43:52Z)
Revisiting Logistic-softmax Likelihood in Bayesian Meta-Learning for Few-Shot Classification [4.813254903898101]
logistic-softmax is often employed as an alternative to the softmax likelihood in multi-class Gaussian process classification. We revisit and redesign the logistic-softmax likelihood, which enables control of the textita priori confidence level through a temperature parameter. Our approach yields well-calibrated uncertainty estimates and achieves comparable or superior results on standard benchmark datasets.
arXiv Detail & Related papers (2023-10-16T13:20:13Z)
Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities. We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability. The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z)
Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions [28.30744223973527]
We give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/sigma) regret for realizable K-wise linear classification. We develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers.
arXiv Detail & Related papers (2022-05-25T21:31:36Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency [83.02999769628593]
We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning. We show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions. We present the first finite-sample result with first-order efficiency in non-tabular environments.
arXiv Detail & Related papers (2021-02-05T03:20:39Z)
Fundamental Limits of Ridge-Regularized Empirical Risk Minimization in High Dimensions [41.7567932118769]
Empirical Risk Minimization algorithms are widely used in a variety of estimation and prediction tasks. In this paper, we characterize for the first time the fundamental limits on the statistical accuracy of convex ERM for inference.
arXiv Detail & Related papers (2020-06-16T04:27:38Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.