Memorization Capacity of Multi-Head Attention in Transformers
- URL: http://arxiv.org/abs/2306.02010v3
- Date: Sat, 2 Mar 2024 07:50:37 GMT
- Title: Memorization Capacity of Multi-Head Attention in Transformers
- Authors: Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis
- Abstract summary: This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize.
Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data.
Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property.
- Score: 41.63663596609437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have become the go-to architecture for language and vision
tasks, yet their theoretical properties, especially memorization capacity,
remain elusive. This paper investigates the memorization abilities of
multi-head attention mechanisms, examining how many example sequences they can
memorize, as a function of the number of heads and sequence length. Motivated
by experimental findings on vision transformers, we introduce novel assumptions
about the linear independence of input data, distinct from the commonly used
general-position assumption. Under these assumptions, we demonstrate that an
attention layer with $H$ heads, dimension $d$, and context size $n < d$,
featuring $\Theta(Hd^2)$ parameters, can memorize $\Omega(Hn)$ examples. Our
analysis sheds light on how different attention heads handle various example
sequences, aided by the softmax operator's saturation property. We validate our
findings through experiments on synthetic data.
Related papers
- If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation? [3.757103053174534]
We investigate whether the attention mechanism of Transformer Grammar (TG) can serve as a cognitive model of human memory retrieval.
Our experiments demonstrate that TG's attention achieves superior predictive power for self-paced reading times compared to vanilla Transformer's.
arXiv Detail & Related papers (2025-02-17T05:58:25Z) - On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
We propose a new multi-head self-attention (MHSA) variant named Fibottention, which can replace MHSA in Transformer architectures.
Fibottention is data-efficient and computationally more suitable for processing large numbers of tokens than the standard MHSA.
It employs structured sparse attention based on dilated Fibonacci sequences, which, uniquely, differ across attention heads, resulting in-like diverse features across heads.
arXiv Detail & Related papers (2024-06-27T17:59:40Z) - Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing [10.206921909332006]
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate.
We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential (reasoning-based) solutions.
We further find that inferential (reasoning-based) solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors.
arXiv Detail & Related papers (2024-05-08T20:23:24Z) - What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks [15.874604623294427]
We show a transformer with only one attention layer can excel in memorization but falls short in other tasks.
We identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations.
arXiv Detail & Related papers (2024-04-02T02:45:12Z) - Sliceformer: Make Multi-head Attention as Simple as Sorting in
Discriminative Tasks [32.33355192614434]
We propose an effective and efficient surrogate of the Transformer, called Sliceformer.
Our Sliceformer replaces the classic MHA mechanism with an extremely simple slicing-sorting'' operation.
Our Sliceformer achieves comparable or better performance with lower memory cost and faster speed than the Transformer and its variants.
arXiv Detail & Related papers (2023-10-26T14:43:07Z) - Leveraging redundancy in attention with Reuse Transformers [58.614198953733194]
Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way.
A typical Transformer model computes such pairwise attention scores repeatedly for the same sequence.
We propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers.
arXiv Detail & Related papers (2021-10-13T16:08:02Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.