Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
- URL: http://arxiv.org/abs/2506.10959v1
- Date: Thu, 12 Jun 2025 17:56:26 GMT
- Title: Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
- Authors: Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao,
- Abstract summary: In-context learning (ICL) has achieved remarkable success in natural language and vision domains.<n>In this work, we initiate a theoretical study of ICL for regression of H"older functions on manifold.<n>Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.
- Score: 48.038668788625465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding--particularly in the context of structured geometric data--remains unexplored. In this work, we initiate a theoretical study of ICL for regression of H\"older functions on manifolds. By establishing a novel connection between the attention mechanism and classical kernel methods, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of H\"older functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.
Related papers
- Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - Learning Beyond Euclid: Curvature-Adaptive Generalization for Neural Networks on Manifolds [0.0]
Existing generalization theories often rely on complexity measures derived from Euclidean geometry, which fail to account for intrinsic structure of non-Euclidean spaces.<n>We derive number bounds that explicitly incorporate manifold-specific properties such as sectional curvature, volume growth, and injectivity radius.<n>This framework provides a principled understanding of how intrinsic geometry affects learning capacity, offering both theoretical insight and practical implications for deep learning on structured data domains.
arXiv Detail & Related papers (2025-07-01T23:16:49Z) - Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning [48.67380502157004]
Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks.<n>The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood.
arXiv Detail & Related papers (2025-05-16T08:50:42Z) - Manifold Learning with Normalizing Flows: Towards Regularity, Expressivity and Iso-Riemannian Geometry [8.020732438595905]
This work focuses on addressing distortions and modeling errors that can arise in the multi-modal setting.<n>We showcase the effectiveness of the synergy of the proposed approaches in several numerical experiments with both synthetic and real data.
arXiv Detail & Related papers (2025-05-12T21:44:42Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network.
Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z) - Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient [0.49478969093606673]
We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory.
We study the development of internal structure in transformer language models during training.
arXiv Detail & Related papers (2024-10-03T20:51:02Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - Asymptotic theory of in-context learning by linear attention [33.53106537972063]
In-context learning is a cornerstone of Transformers' success.<n>Questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved.
arXiv Detail & Related papers (2024-05-20T03:24:24Z) - On Computational Modeling of Sleep-Wake Cycle [5.234742752529437]
Neuroscience treats sleep and wake as default and perturbation modes of the brain.
It is hypothesized that the brain self-organizes neural activities without environmental inputs.
This paper presents a new computational model of the sleep-wake cycle for learning and memory.
arXiv Detail & Related papers (2024-04-08T13:06:23Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.