One Step of Gradient Descent is Provably the Optimal In-Context Learner
with One Layer of Linear Self-Attention
- URL: http://arxiv.org/abs/2307.03576v1
- Date: Fri, 7 Jul 2023 13:09:18 GMT
- Title: One Step of Gradient Descent is Provably the Optimal In-Context Learner
with One Layer of Linear Self-Attention
- Authors: Arvind Mahankali, Tatsunori B. Hashimoto, Tengyu Ma
- Abstract summary: Recent works have empirically analyzed in-context learning.
One-layer transformers with linear self-attention learn to implement one step of gradient descent.
- Score: 31.522320487765878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works have empirically analyzed in-context learning and shown that
transformers trained on synthetic linear regression tasks can learn to
implement ridge regression, which is the Bayes-optimal predictor, given
sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with
linear self-attention and no MLP layer will learn to implement one step of
gradient descent (GD) on a least-squares linear regression objective [von
Oswald et al., 2022]. However, the theory behind these observations remains
poorly understood. We theoretically study transformers with a single layer of
linear self-attention, trained on synthetic noisy linear regression data.
First, we mathematically show that when the covariates are drawn from a
standard Gaussian distribution, the one-layer transformer which minimizes the
pre-training loss will implement a single step of GD on the least-squares
linear regression objective. Then, we find that changing the distribution of
the covariates and weight vector to a non-isotropic Gaussian distribution has a
strong impact on the learned algorithm: the global minimizer of the
pre-training loss now implements a single step of $\textit{pre-conditioned}$
GD. However, if only the distribution of the responses is changed, then this
does not have a large effect on the learned algorithm: even when the response
comes from a more general family of $\textit{nonlinear}$ functions, the global
minimizer of the pre-training loss still implements a single step of GD on a
least-squares linear regression objective.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training.
We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Understanding Incremental Learning of Gradient Descent: A Fine-grained
Analysis of Matrix Sensing [74.2952487120137]
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models.
This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
arXiv Detail & Related papers (2023-01-27T02:30:51Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Scale-free Unconstrained Online Learning for Curved Losses [1.5147172044848798]
We investigate the possibility of adapting simultaneously to the norm $U$ of the comparator and the maximum norm $G$ of the gradients.
Surprisingly, recent results show that no such price for adaptivity is needed in the specific case of $1$-Lipschitz losses.
arXiv Detail & Related papers (2022-02-11T14:10:35Z) - Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity
Bias [34.81794649454105]
Real-life neural networks are from small random values and trained with cross-entropy loss for classification.
Recent results show that gradient descent converges to the "max-margin" solution with zero loss, which presumably generalizes well.
The current paper is able to establish this global optimality for two-layer ReLU nets trained with gradient flow on linearly separable and symmetric data.
arXiv Detail & Related papers (2021-10-26T17:57:57Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning.
Proximal gradient methods are used as standard approaches to solve OWL regression.
We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.