Birth of a Transformer: A Memory Viewpoint
- URL: http://arxiv.org/abs/2306.00802v2
- Date: Mon, 6 Nov 2023 22:51:51 GMT
- Title: Birth of a Transformer: A Memory Viewpoint
- Authors: Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon
Bottou
- Abstract summary: Large language models based on transformers have achieved great empirical successes.
As they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable.
We study how transformers balance these two types of distributions of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigrams.
- Score: 25.294093283819443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models based on transformers have achieved great empirical
successes. However, as they are deployed more widely, there is a growing need
to better understand their internal mechanisms in order to make them more
reliable. These models appear to store vast amounts of knowledge from their
training data, and to adapt quickly to new information provided in their
context or prompt. We study how transformers balance these two types of
knowledge by considering a synthetic setup where tokens are generated from
either global or context-specific bigram distributions. By a careful empirical
analysis of the training process on a simplified two-layer transformer, we
illustrate the fast learning of global bigrams and the slower development of an
"induction head" mechanism for the in-context bigrams. We highlight the role of
weight matrices as associative memories, provide theoretical insights on how
gradients enable their learning during training, and study the role of
data-distributional properties.
Related papers
- In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression [19.64743851296488]
In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning.
We experimentally discover that the utilization of multi-heads exhibits different patterns across layers.
We demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms.
arXiv Detail & Related papers (2024-08-08T15:33:02Z) - Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [22.033370572209744]
We study whether transformers can learn to implicitly reason over parametric knowledge.
We focus on two representative reasoning types, composition and comparison.
We find that transformers can learn implicit reasoning, but only through grokking.
arXiv Detail & Related papers (2024-05-23T21:42:19Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Modifying Memories in Transformer Models [71.48657481835767]
We propose a new task of emphexplicitly modifying specific factual knowledge in Transformer models.
This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models.
arXiv Detail & Related papers (2020-12-01T09:39:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.