Understanding Factual Recall in Transformers via Associative Memories
- URL: http://arxiv.org/abs/2412.06538v1
- Date: Mon, 09 Dec 2024 14:48:14 GMT
- Title: Understanding Factual Recall in Transformers via Associative Memories
- Authors: Eshaan Nichani, Jason D. Lee, Alberto Bietti,
- Abstract summary: We show that shallow transformers can use a combination of associative memories to obtain near optimal storage capacity.
We show that a transformer with a single layer of self-attention followed by an parameters can obtain 100% accuracy on a factual recall task.
- Score: 55.93756571457904
- License:
- Abstract: Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.
Related papers
- Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers [40.964584197528175]
Large Language Models (LLMs) have the capacity to store and recall facts.
LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts.
arXiv Detail & Related papers (2024-06-26T14:49:54Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - The Closeness of In-Context Learning and Weight Shifting for Softmax
Regression [42.95984289657388]
We study the in-context learning based on a softmax regression formulation.
We show that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.
arXiv Detail & Related papers (2023-04-26T04:33:41Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Understanding Transformer Memorization Recall Through Idioms [42.28269674547148]
We offer the first methodological framework for probing and characterizing recall of memorized sequences in language models.
We analyze the internal prediction construction process by interpreting the model's hidden representations as a gradual refinement of the output probability distribution.
Our work makes a first step towards understanding memory recall, and provides a methodological basis for future studies of transformer memorization.
arXiv Detail & Related papers (2022-10-07T14:45:31Z) - PairConnect: A Compute-Efficient MLP Alternative to Attention [31.659580535552184]
We show a memory-heavy but significantly more compute-efficient alternative to Transformer.
Our proposal, denoted as PairConnect, models the pairwise interaction between words by explicit pairwise word embeddings.
Our experiment on language modeling suggests that PairConnect could achieve comparable results with Transformer while reducing the computational cost associated with inference significantly.
arXiv Detail & Related papers (2021-06-15T15:39:45Z) - Pay Attention to MLPs [84.54729425918164]
We show that gMLP can perform as well as Transformers in key language and applications.
Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy.
In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
arXiv Detail & Related papers (2021-05-17T17:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.