Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective
- URL: http://arxiv.org/abs/2512.11784v1
- Date: Fri, 12 Dec 2025 18:54:52 GMT
- Title: Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective
- Authors: Etienne Boursier, Claire Boyer,
- Abstract summary: We develop a measure-based framework for studying single-layer softmax attention under both finite and infinite prompts.<n>For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure.
- Score: 16.076157672455867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.
Related papers
- Softmax Linear Attention: Reclaiming Global Competition [28.81301173774774]
We propose textbfSoftmax Linear Attention (SLA), a framework designed to restore competitive selection without sacrificing efficiency.<n>Experiments demonstrate SLA consistently enhances state-of-the-art linear baselines across language modeling and long-context benchmarks.
arXiv Detail & Related papers (2026-02-02T07:25:03Z) - SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention [50.99430451151184]
Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy.<n>We present an analytical study that contrasts linear and softmax attention for visual representation learning.<n>We propose SoLA-Vision, a flexible layer-wise hybrid attention backbone.
arXiv Detail & Related papers (2026-01-16T10:26:53Z) - Statistical Advantage of Softmax Attention: Insights from Single-Location Regression [0.0]
We study the dominance of softmax over alternatives in large language models.<n>We show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short.<n>We discuss the connection with optimization by gradient-based algorithms.
arXiv Detail & Related papers (2025-09-26T06:21:30Z) - On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective [3.1044138971639743]
Main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length.<n>By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention.<n>This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention.
arXiv Detail & Related papers (2025-07-31T15:10:03Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z) - Directed Exploration in Reinforcement Learning from Linear Temporal Logic [59.707408697394534]
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning.<n>We show that the synthesized reward signal remains fundamentally sparse, making exploration challenging.<n>We show how better exploration can be achieved by further leveraging the specification and casting its corresponding Limit Deterministic B"uchi Automaton (LDBA) as a Markov reward process.
arXiv Detail & Related papers (2024-08-18T14:25:44Z) - Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [54.20763128054692]
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression.
We prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics.
arXiv Detail & Related papers (2024-02-29T18:43:52Z) - Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented.
$p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.