Pretraining task diversity and the emergence of non-Bayesian in-context
learning for regression
- URL: http://arxiv.org/abs/2306.15063v2
- Date: Wed, 8 Nov 2023 18:12:03 GMT
- Title: Pretraining task diversity and the emergence of non-Bayesian in-context
learning for regression
- Authors: Allan Ravent\'os, Mansheej Paul, Feng Chen, Surya Ganguli
- Abstract summary: Pretrained transformers exhibit the remarkable ability of in-context learning (ICL)
Can ICL solve fundamentally $textitnew$ tasks that are very different from those seen during pretraining?
- Score: 31.950737940558984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained transformers exhibit the remarkable ability of in-context learning
(ICL): they can learn tasks from just a few examples provided in the prompt
without updating any weights. This raises a foundational question: can ICL
solve fundamentally $\textit{new}$ tasks that are very different from those
seen during pretraining? To probe this question, we examine ICL's performance
on linear regression while varying the diversity of tasks in the pretraining
dataset. We empirically demonstrate a $\textit{task diversity threshold}$ for
the emergence of ICL. Below this threshold, the pretrained transformer cannot
solve unseen regression tasks, instead behaving like a Bayesian estimator with
the $\textit{non-diverse pretraining task distribution}$ as the prior. Beyond
this threshold, the transformer significantly outperforms this estimator; its
behavior aligns with that of ridge regression, corresponding to a Gaussian
prior over $\textit{all tasks}$, including those not seen during pretraining.
Thus, when pretrained on data with task diversity greater than the threshold,
transformers $\textit{can}$ optimally solve fundamentally new tasks in-context.
Importantly, this capability hinges on it deviating from the Bayes optimal
estimator with the pretraining distribution as the prior. This study also
explores the effect of regularization, model capacity and task structure and
underscores, in a concrete example, the critical role of task diversity,
alongside data and model scale, in the emergence of ICL. Code is available at
https://github.com/mansheej/icl-task-diversity.
Related papers
- Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights.
In this paper, we investigate the effect of explicitly inferring task latents.
We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z) - Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification [7.869708570399577]
We consider a bi-objective prediction task of predicting both the conditional expectation $mathbbE[Y|X]$ and the conditional variance Var$(Y|X)$.
Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution.
arXiv Detail & Related papers (2024-05-24T00:08:55Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks.
Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - Blessing of Class Diversity in Pre-training [54.335530406959435]
We prove that when the classes of the pre-training task are sufficiently diverse, pre-training can significantly improve the sample efficiency of downstream tasks.
Our proof relies on a vector-form Rademacher complexity chain rule for composite function classes and a modified self-concordance condition.
arXiv Detail & Related papers (2022-09-07T20:10:12Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.