Related papers: Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent

Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent

URL: http://arxiv.org/abs/2510.10425v1
Date: Sun, 12 Oct 2025 03:20:27 GMT
Title: Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent
Authors: Sara Dragutinović, Andrew M. Saxe, Aaditya K. Singh,
Abstract summary: We focus on understanding the learning algorithm transformers use to learn from context.<n>We find that transformers still learn to do gradient descent in-context, though on functionals in the kernel feature space.<n>These theoretical findings suggest a greater adaptability to context for softmax attention.
Score: 17.629377639287775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The remarkable ability of transformers to learn new concepts solely by reading examples within the input prompt, termed in-context learning (ICL), is a crucial aspect of intelligent behavior. Here, we focus on understanding the learning algorithm transformers use to learn from context. Existing theoretical work, often based on simplifying assumptions, has primarily focused on linear self-attention and continuous regression tasks, finding transformers can learn in-context by gradient descent. Given that transformers are typically trained on discrete and complex tasks, we bridge the gap from this existing work to the setting of classification, with non-linear (importantly, softmax) activation. We find that transformers still learn to do gradient descent in-context, though on functionals in the kernel feature space and with a context-adaptive learning rate in the case of softmax transformer. These theoretical findings suggest a greater adaptability to context for softmax attention, which we empirically verify and study through ablations. Overall, we hope this enhances theoretical understanding of in-context learning algorithms in more realistic settings, pushes forward our intuitions and enables further theory bridging to larger models.

Related papers

On the Robustness of Transformers against Context Hijacking for Linear Classification [26.1838836907147]
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities.<n>They can be disrupted by factually correct context, a phenomenon known as context hijacking.<n>We show that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations.
arXiv Detail & Related papers (2025-02-21T17:31:00Z)
One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [48.4979348643494]
We study the capability of one-layer transformers learning the one-nearest neighbor rule. A single softmax attention layer can successfully learn to behave like a one-nearest neighbor.
arXiv Detail & Related papers (2024-11-16T16:12:42Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context [44.949726166566236]
We show that (non-linear) Transformers naturally learn to implement gradient descent in function space. We also show that the optimal choice of non-linear activation depends in a natural way on the class of functions that need to be learned.
arXiv Detail & Related papers (2023-12-11T17:05:25Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
A Closer Look at In-Context Learning under Distribution Shifts [24.59271215602147]
We aim to better understand the generality and limitations of in-context learning from the lens of the simple yet fundamental task of linear regression. We find that both transformers and set-based distributions exhibit in-context learning under-distribution evaluations, but transformers more closely emulate the performance of ordinary least squares (OLS) Transformers also display better resilience to mild distribution shifts, where set-based distributions falter.
arXiv Detail & Related papers (2023-05-26T07:47:21Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
What learning algorithm is in-context learning? Investigations with linear models [87.91612418166464]
We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly. We show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression. Preliminary evidence that in-context learners share algorithmic features with these predictors.
arXiv Detail & Related papers (2022-11-28T18:59:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.