When can in-context learning generalize out of task distribution?
- URL: http://arxiv.org/abs/2506.05574v1
- Date: Thu, 05 Jun 2025 20:30:50 GMT
- Title: When can in-context learning generalize out of task distribution?
- Authors: Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab,
- Abstract summary: In-context learning (ICL) is a capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples.<n>We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize emphout-of-distribution<n>We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space.
- Score: 10.962094053749095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.
Related papers
- When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers [64.1656365676171]
Task arithmetic refers to editing the pre-trained model by adding a weighted sum of task vectors.<n>This paper theoretically prove the effectiveness of task addition in simultaneously learning a set of irrelevant or irrelevant tasks.<n>We prove the proper selection for task arithmetic to achieve negation to out-of-domain tasks.
arXiv Detail & Related papers (2025-04-15T08:04:39Z) - Differential learning kinetics govern the transition from memorization to generalization during in-context learning [0.5555497750998242]
Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates.<n>Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks.<n>We show that the sub-circuits that memorize and generalize can be viewed as largely independent.
arXiv Detail & Related papers (2024-11-27T22:12:29Z) - In-Context Learning of Linear Systems: Generalization Theory and Applications to Operator Learning [10.333724466273233]
We study theoretical guarantees for solving linear systems in-context using a linear transformer architecture.<n>For in-domain generalization, we provide neural scaling laws that bound the generalization error in terms of the number of tasks and sizes of samples used in training and inference.<n>For out-of-domain generalization, we find that the behavior of trained transformers under task distribution shifts depends crucially on the distribution of the tasks seen during training.
arXiv Detail & Related papers (2024-09-18T19:59:50Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights.
In this paper, we investigate the effect of explicitly inferring task latents.
We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z) - Asymptotic theory of in-context learning by linear attention [33.53106537972063]
In-context learning is a cornerstone of Transformers' success.<n>Questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved.
arXiv Detail & Related papers (2024-05-20T03:24:24Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Pretraining task diversity and the emergence of non-Bayesian in-context
learning for regression [31.950737940558984]
Pretrained transformers exhibit the remarkable ability of in-context learning (ICL)
Can ICL solve fundamentally $textitnew$ tasks that are very different from those seen during pretraining?
arXiv Detail & Related papers (2023-06-26T21:05:20Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Leveraging sparse and shared feature activations for disentangled
representation learning [112.22699167017471]
We propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation.
We validate our approach on six real world distribution shift benchmarks, and different data modalities.
arXiv Detail & Related papers (2023-04-17T01:33:24Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Precise High-Dimensional Asymptotics for Quantifying Heterogeneous Transfers [66.66228496844191]
The problem of learning one task with samples from another task is central to transfer learning (TL)<n>In this paper, we examine a fundamental question: When does combining the data samples from a source task and a target task perform better than single-task learning with the target task alone?
arXiv Detail & Related papers (2020-10-22T14:14:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.