How do Transformers perform In-Context Autoregressive Learning?
- URL: http://arxiv.org/abs/2402.05787v2
- Date: Wed, 5 Jun 2024 13:44:00 GMT
- Title: How do Transformers perform In-Context Autoregressive Learning?
- Authors: Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré,
- Abstract summary: We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
- Score: 76.18489638049545
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a simple next token prediction task, where sequences are generated as a first-order autoregressive process $s_{t+1} = W s_t$. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping. We call the resulting procedure in-context autoregressive learning. More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens. When the tokens are not augmented, we characterize the global minima of a one-layer diagonal linear multi-head Transformer. Importantly, we exhibit orthogonality between heads and show that positional encoding captures trigonometric relations in the data. On the experimental side, we consider the general case of non-commuting orthogonal matrices and generalize our theoretical findings.
Related papers
- On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training.
We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z) - Toward a Theory of Tokenization in LLMs [26.516041872337887]
We study tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes.
We show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $ktextth$-order Markov sources near optimally.
arXiv Detail & Related papers (2024-04-12T09:01:14Z) - How Well Can Transformers Emulate In-context Newton's Method? [46.08521978754298]
We study whether Transformers can perform higher order optimization methods, beyond the case of linear regression.
We demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers.
arXiv Detail & Related papers (2024-03-05T18:20:10Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Transformers learn to implement preconditioned gradient descent for
in-context learning [41.74394657009037]
Several recent works demonstrate that transformers can implement algorithms like gradient descent.
We ask: Can transformers learn to implement such algorithms by training over random problem instances?
For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent.
arXiv Detail & Related papers (2023-06-01T02:35:57Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z) - Non-Autoregressive Machine Translation with Disentangled Context
Transformer [70.95181466892795]
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens.
We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts.
Our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
arXiv Detail & Related papers (2020-01-15T05:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.