Transformers as Algorithms: Generalization and Implicit Model Selection
in In-context Learning
- URL: http://arxiv.org/abs/2301.07067v1
- Date: Tue, 17 Jan 2023 18:31:12 GMT
- Title: Transformers as Algorithms: Generalization and Implicit Model Selection
in In-context Learning
- Authors: Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak
- Abstract summary: In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly.
We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm.
We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
- Score: 23.677503557659705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) is a type of prompting where a transformer model
operates on a sequence of (input, output) examples and performs inference
on-the-fly. This implicit training is in contrast to explicitly tuning the
model weights based on examples. In this work, we formalize in-context learning
as an algorithm learning problem, treating the transformer model as a learning
algorithm that can be specialized via training to implement-at
inference-time-another target algorithm. We first explore the statistical
aspects of this abstraction through the lens of multitask learning: We obtain
generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d.
(input, label) pairs or (2) a trajectory arising from a dynamical system. The
crux of our analysis is relating the excess risk to the stability of the
algorithm implemented by the transformer, which holds under mild assumptions.
Secondly, we use our abstraction to show that transformers can act as an
adaptive learning algorithm and perform model selection across different
hypothesis classes. We provide numerical evaluations that (1) demonstrate
transformers can indeed implement near-optimal algorithms on classical
regression problems with i.i.d. and dynamic data, (2) identify an inductive
bias phenomenon where the transfer risk on unseen tasks is independent of the
transformer complexity, and (3) empirically verify our theoretical predictions.
Related papers
- Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making [7.8816327398541635]
We consider the supervised pre-trained transformer for a class of sequential decision-making problems.
Such a structure enables the use of optimal actions/decisions in the pre-training phase.
arXiv Detail & Related papers (2024-05-23T06:28:44Z) - Uncovering Intermediate Variables in Transformers using Circuit Probing [32.382094867951224]
We propose a new analysis technique -- circuit probing -- that automatically uncovers low-level circuits that compute hypothesized intermediate variables.
We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training.
arXiv Detail & Related papers (2023-11-07T21:27:17Z) - Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining [25.669038513039357]
This paper provides a theoretical framework that analyzes supervised pretraining for in-context reinforcement learning.
We show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms.
arXiv Detail & Related papers (2023-10-12T17:55:02Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Understanding In-Context Learning in Transformers and LLMs by Learning
to Learn Discrete Functions [32.59746882017483]
We show that Transformers can learn to implement two distinct algorithms to solve a single task.
We also show that extant Large Language Models (LLMs) can compete with nearest-neighbor baselines on prediction tasks.
arXiv Detail & Related papers (2023-10-04T17:57:33Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.