Transformers learn to implement preconditioned gradient descent for
in-context learning
- URL: http://arxiv.org/abs/2306.00297v2
- Date: Thu, 9 Nov 2023 21:46:18 GMT
- Title: Transformers learn to implement preconditioned gradient descent for
in-context learning
- Authors: Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, Suvrit Sra
- Abstract summary: Several recent works demonstrate that transformers can implement algorithms like gradient descent.
We ask: Can transformers learn to implement such algorithms by training over random problem instances?
For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent.
- Score: 41.74394657009037
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Several recent works demonstrate that transformers can implement algorithms
like gradient descent. By a careful construction of weights, these works show
that multiple layers of transformers are expressive enough to simulate
iterations of gradient descent. Going beyond the question of expressivity, we
ask: Can transformers learn to implement such algorithms by training over
random problem instances? To our knowledge, we make the first theoretical
progress on this question via an analysis of the loss landscape for linear
transformers trained over random instances of linear regression. For a single
attention layer, we prove the global minimum of the training objective
implements a single iteration of preconditioned gradient descent. Notably, the
preconditioning matrix not only adapts to the input distribution but also to
the variance induced by data inadequacy. For a transformer with $L$ attention
layers, we prove certain critical points of the training objective implement
$L$ iterations of preconditioned gradient descent. Our results call for future
theoretical studies on learning algorithms by training transformers.
Related papers
- One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [48.4979348643494]
We study the capability of one-layer transformers learning the one-nearest neighbor rule.
A single softmax attention layer can successfully learn to behave like a one-nearest neighbor.
arXiv Detail & Related papers (2024-11-16T16:12:42Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - How Well Can Transformers Emulate In-context Newton's Method? [46.08521978754298]
We study whether Transformers can perform higher order optimization methods, beyond the case of linear regression.
We demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers.
arXiv Detail & Related papers (2024-03-05T18:20:10Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.