In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness
- URL: http://arxiv.org/abs/2402.11639v2
- Date: Tue, 28 May 2024 05:15:53 GMT
- Title: In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness
- Authors: Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai,
- Abstract summary: We show the role of softmax attention in an ICL setting where each context encodes a regression task.
We show that an attention unit learns a window that it uses to implement a nearest-neighbors predictor adapted to the landscape of the pretraining tasks.
We also show that on low-rank, linear problems, the attention unit learns to project onto the appropriate subspace before inference.
- Score: 43.70647711168682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A striking property of transformers is their ability to perform in-context learning (ICL), a machine learning framework in which the learner is presented with a novel context during inference implicitly through some data, and tasked with making a prediction in that context. As such, that learner must adapt to the context without additional training. We explore the role of softmax attention in an ICL setting where each context encodes a regression task. We show that an attention unit learns a window that it uses to implement a nearest-neighbors predictor adapted to the landscape of the pretraining tasks. Specifically, we show that this window widens with decreasing Lipschitzness and increasing label noise in the pretraining tasks. We also show that on low-rank, linear problems, the attention unit learns to project onto the appropriate subspace before inference. Further, we show that this adaptivity relies crucially on the softmax activation and thus cannot be replicated by the linear activation often studied in prior theoretical analyses.
Related papers
- Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - Scaling Context Requires Rethinking Attention [5.923968936360167]
We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths.<n>We introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters.<n>Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.
arXiv Detail & Related papers (2025-07-06T04:15:34Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.
Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Rethinking Associative Memory Mechanism in Induction Head [37.93644115914534]
This paper investigates how a two-layer transformer thoroughly captures in-context information and balances it with pretrained bigram knowledge in next token prediction.<n>We theoretically analyze the representation of weight matrices in attention layers and the resulting logits when a transformer is given prompts generated by a bigram model.
arXiv Detail & Related papers (2024-12-16T05:33:05Z) - On the Loss of Context-awareness in General Instruction Fine-tuning [101.03941308894191]
We investigate the loss of context awareness after supervised fine-tuning.
We find that the performance decline is associated with a bias toward different roles learned during conversational instruction fine-tuning.
We propose a metric to identify context-dependent examples from general instruction fine-tuning datasets.
arXiv Detail & Related papers (2024-11-05T00:16:01Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning [17.614980614656407]
We propose Continual Generative training for Incremental prompt-Learning.
We exploit Variational Autoencoders to learn class-conditioned distributions.
We show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities.
arXiv Detail & Related papers (2024-07-22T16:51:28Z) - Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [79.28821338925947]
Domain-Class Incremental Learning is a realistic but challenging continual learning scenario.
To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability.
This incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability.
Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy overhead.
We propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of
arXiv Detail & Related papers (2024-07-07T12:19:37Z) - Bigger is not Always Better: The Effect of Context Size on Speech
Pre-Training [8.130638226288402]
We investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning.
We find that phone discriminability in the resulting model representations peaks at around 40ms of preceding context.
We find that this pattern also transfers to supervised ASR when the pre-trained representations are used as frozen input features.
arXiv Detail & Related papers (2023-12-03T22:08:54Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - On the Role of Attention in Prompt-tuning [90.97555030446563]
We study prompt-tuning for one-layer attention architectures and study contextual mixture-models.
We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention.
We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information.
arXiv Detail & Related papers (2023-06-06T06:23:38Z) - The Closeness of In-Context Learning and Weight Shifting for Softmax
Regression [42.95984289657388]
We study the in-context learning based on a softmax regression formulation.
We show that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.
arXiv Detail & Related papers (2023-04-26T04:33:41Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z) - SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory [61.44510300515693]
We study the task of simultaneous lip and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory.
The experiments show that the SimulLR achieves the translation speedup 9.10 times times compared with the state-of-the-art non-simultaneous methods.
arXiv Detail & Related papers (2021-08-31T05:54:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.