What Can Transformers Learn In-Context? A Case Study of Simple Function
Classes
- URL: http://arxiv.org/abs/2208.01066v3
- Date: Fri, 11 Aug 2023 19:27:58 GMT
- Title: What Can Transformers Learn In-Context? A Case Study of Simple Function
Classes
- Authors: Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant
- Abstract summary: In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples.
We show that standard Transformers can be trained from scratch to perform in-context learning of linear functions.
We also show that we can train Transformers to in-context learn more complex function classes with performance that matches or exceeds task-specific learning algorithms.
- Score: 67.06980111346245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In-context learning refers to the ability of a model to condition on a prompt
sequence consisting of in-context examples (input-output pairs corresponding to
some task) along with a new query input, and generate the corresponding output.
Crucially, in-context learning happens only at inference time without any
parameter updates to the model. While large language models such as GPT-3
exhibit some ability to perform in-context learning, it is unclear what the
relationship is between tasks on which this succeeds and what is present in the
training data. To make progress towards understanding in-context learning, we
consider the well-defined problem of training a model to in-context learn a
function class (e.g., linear functions): that is, given data derived from some
functions in the class, can we train a model to in-context learn "most"
functions from this class? We show empirically that standard Transformers can
be trained from scratch to perform in-context learning of linear functions --
that is, the trained model is able to learn unseen linear functions from
in-context examples with performance comparable to the optimal least squares
estimator. In fact, in-context learning is possible even under two forms of
distribution shift: (i) between the training data of the model and
inference-time prompts, and (ii) between the in-context examples and the query
input during inference. We also show that we can train Transformers to
in-context learn more complex function classes -- namely sparse linear
functions, two-layer neural networks, and decision trees -- with performance
that matches or exceeds task-specific learning algorithms. Our code and models
are available at https://github.com/dtsip/in-context-learning .
Related papers
- In-context Learning in Presence of Spurious Correlations [8.055478206164105]
We study the possibility of training an in-context learner for classification tasks involving spurious features.
We find that the conventional approach of training in-context learners is susceptible to spurious features.
We propose a novel technique to train such a learner for a given classification task.
arXiv Detail & Related papers (2024-10-04T04:26:36Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting [15.69952375347308]
Language models have the ability to perform in-context learning (ICL), allowing them to flexibly adapt their behavior based on context.
We study structural in-context algorithms in a simple part-of-speech setting using both practical and toy models.
We find that active forgetting, a technique that was recently introduced to help models generalize to new languages, forces models to adopt structural in-context learning solutions.
arXiv Detail & Related papers (2024-05-28T21:38:20Z) - Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in
Transformer Models [9.340409961107955]
Transformer models have the remarkable ability to perform in-context learning (ICL)
We study how effectively transformers can bridge between their pretraining data mixture.
Our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases.
arXiv Detail & Related papers (2023-11-01T21:41:08Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Concept-aware Training Improves In-context Learning Ability of Language
Models [0.0]
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability.
We propose a method to create LMs able to better utilize the in-context information.
We measure that data sampling of Concept-aware Training consistently improves models' reasoning ability.
arXiv Detail & Related papers (2023-05-23T07:44:52Z) - An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z) - Learning to Match Jobs with Resumes from Sparse Interaction Data using
Multi-View Co-Teaching Network [83.64416937454801]
Job-resume interaction data is sparse and noisy, which affects the performance of job-resume match algorithms.
We propose a novel multi-view co-teaching network from sparse interaction data for job-resume matching.
Our model is able to outperform state-of-the-art methods for job-resume matching.
arXiv Detail & Related papers (2020-09-25T03:09:54Z) - From Learning to Meta-Learning: Reduced Training Overhead and Complexity
for Communication Systems [40.427909614453526]
Machine learning methods adapt the parameters of a model, constrained to lie in a given model class, by using a fixed learning procedure based on data or active observations.
With a meta-trained inductive bias, training of a machine learning model can be potentially carried out with reduced training data and/or time complexity.
This paper provides a high-level introduction to meta-learning with applications to communication systems.
arXiv Detail & Related papers (2020-01-05T12:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.