Related papers: One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

URL: http://arxiv.org/abs/2307.03576v1
Date: Fri, 7 Jul 2023 13:09:18 GMT
Title: One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
Authors: Arvind Mahankali, Tatsunori B. Hashimoto, Tengyu Ma
Abstract summary: Recent works have empirically analyzed in-context learning. One-layer transformers with linear self-attention learn to implement one step of gradient descent.
Score: 31.522320487765878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of $\textit{pre-conditioned}$ GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of $\textit{nonlinear}$ functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.

Related papers

A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z)
Learning Curves of Stochastic Gradient Descent in Kernel Regression [7.063108005500741]
We analyze the single-pass Gradient Descent (SGD) in kernel regression under source condition.<n>Surprisingly, we show that SGD achieves min-max optimal rates up to constants among all the scales.<n>The main reason for SGD to overcome the curse of saturation is the exponentially decaying step size schedule.
arXiv Detail & Related papers (2025-05-28T07:16:11Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training. We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing [74.2952487120137]
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
arXiv Detail & Related papers (2023-01-27T02:30:51Z)
Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers. A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z)
Scale-free Unconstrained Online Learning for Curved Losses [1.5147172044848798]
We investigate the possibility of adapting simultaneously to the norm $U$ of the comparator and the maximum norm $G$ of the gradients. Surprisingly, recent results show that no such price for adaptivity is needed in the specific case of $1$-Lipschitz losses.
arXiv Detail & Related papers (2022-02-11T14:10:35Z)
Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias [34.81794649454105]
Real-life neural networks are from small random values and trained with cross-entropy loss for classification. Recent results show that gradient descent converges to the "max-margin" solution with zero loss, which presumably generalizes well. The current paper is able to establish this global optimality for two-layer ReLU nets trained with gradient flow on linearly separable and symmetric data.
arXiv Detail & Related papers (2021-10-26T17:57:57Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning. Proximal gradient methods are used as standard approaches to solve OWL regression. We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.