What learning algorithm is in-context learning? Investigations with
linear models
- URL: http://arxiv.org/abs/2211.15661v3
- Date: Wed, 17 May 2023 21:08:32 GMT
- Title: What learning algorithm is in-context learning? Investigations with
linear models
- Authors: Ekin Aky\"urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, Denny Zhou
- Abstract summary: We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly.
We show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression.
Preliminary evidence that in-context learners share algorithmic features with these predictors.
- Score: 87.91612418166464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural sequence models, especially transformers, exhibit a remarkable
capacity for in-context learning. They can construct new predictors from
sequences of labeled examples $(x, f(x))$ presented in the input without
further parameter updates. We investigate the hypothesis that transformer-based
in-context learners implement standard learning algorithms implicitly, by
encoding smaller models in their activations, and updating these implicit
models as new examples appear in the context. Using linear regression as a
prototypical problem, we offer three sources of evidence for this hypothesis.
First, we prove by construction that transformers can implement learning
algorithms for linear models based on gradient descent and closed-form ridge
regression. Second, we show that trained in-context learners closely match the
predictors computed by gradient descent, ridge regression, and exact
least-squares regression, transitioning between different predictors as
transformer depth and dataset noise vary, and converging to Bayesian estimators
for large widths and depths. Third, we present preliminary evidence that
in-context learners share algorithmic features with these predictors: learners'
late layers non-linearly encode weight vectors and moment matrices. These
results suggest that in-context learning is understandable in algorithmic
terms, and that (at least in the linear case) learners may rediscover standard
estimation algorithms. Code and reference implementations are released at
https://github.com/ekinakyurek/google-research/blob/master/incontext.
Related papers
- In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Transformers as Algorithms: Generalization and Implicit Model Selection
in In-context Learning [23.677503557659705]
In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly.
We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm.
We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
arXiv Detail & Related papers (2023-01-17T18:31:12Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws [24.356906682593532]
We study the compute-optimal trade-off between model and training data set sizes for large neural networks.
Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla.
arXiv Detail & Related papers (2022-12-02T18:46:41Z) - Granger Causality using Neural Networks [8.835231777363399]
We present several new classes of models that can handle underlying non-linearity.
We show one can directly decouple lags and individual time series importance via decoupled penalties.
We also show one can directly decouple lags and individual time series importance via decoupled penalties.
arXiv Detail & Related papers (2022-08-07T12:02:48Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Merging Two Cultures: Deep and Statistical Learning [3.15863303008255]
Merging the two cultures of deep and statistical learning provides insights into structured high-dimensional data.
We show that prediction, optimisation and uncertainty can be achieved using probabilistic methods at the output layer of the model.
arXiv Detail & Related papers (2021-10-22T02:57:21Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Deep Learning for Quantile Regression under Right Censoring:
DeepQuantreg [1.0152838128195467]
This paper presents a novel application of the neural network to the quantile regression for survival data with right censoring.
The main purpose of this work is to show that the deep learning method could be flexible enough to predict nonlinear patterns more accurately compared to existing quantile regression methods.
arXiv Detail & Related papers (2020-07-14T14:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.