Trained Transformers Learn Linear Models In-Context
- URL: http://arxiv.org/abs/2306.09927v3
- Date: Thu, 19 Oct 2023 20:31:32 GMT
- Title: Trained Transformers Learn Linear Models In-Context
- Authors: Ruiqi Zhang, Spencer Frei, Peter L. Bartlett
- Abstract summary: Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL)
We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
- Score: 39.56636898650966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based neural networks such as transformers have demonstrated a
remarkable ability to exhibit in-context learning (ICL): Given a short prompt
sequence of tokens from an unseen task, they can formulate relevant per-token
and next-token predictions without any parameter updates. By embedding a
sequence of labeled training data and unlabeled test data as a prompt, this
allows for transformers to behave like supervised learning algorithms. Indeed,
recent work has shown that when training transformer architectures over random
instances of linear regression problems, these models' predictions mimic those
of ordinary least squares.
Towards understanding the mechanisms underlying this phenomenon, we
investigate the dynamics of ICL in transformers with a single linear
self-attention layer trained by gradient flow on linear regression tasks. We
show that despite non-convexity, gradient flow with a suitable random
initialization finds a global minimum of the objective function. At this global
minimum, when given a test prompt of labeled examples from a new prediction
task, the transformer achieves prediction error competitive with the best
linear predictor over the test prompt distribution. We additionally
characterize the robustness of the trained transformer to a variety of
distribution shifts and show that although a number of shifts are tolerated,
shifts in the covariate distribution of the prompts are not. Motivated by this,
we consider a generalized ICL setting where the covariate distributions can
vary across prompts. We show that although gradient flow succeeds at finding a
global minimum in this setting, the trained transformer is still brittle under
mild covariate shifts. We complement this finding with experiments on large,
nonlinear transformer architectures which we show are more robust under
covariate shifts.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context [25.360386832940875]
We show that when linear transformers are pre-trained on random instances for linear regression tasks, they make predictions using an algorithm similar to that of ordinary least squares.
In some settings, these trained transformers can exhibit "benign overfitting in-context"
arXiv Detail & Related papers (2024-10-02T17:30:21Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training.
We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.