Optimal Memorization Capacity of Transformers
- URL: http://arxiv.org/abs/2409.17677v1
- Date: Thu, 26 Sep 2024 09:36:47 GMT
- Title: Optimal Memorization Capacity of Transformers
- Authors: Tokio Kajitsuka, Issei Sato,
- Abstract summary: We show that Transformers can memorize labels with $tildeO(sqrtN)$ parameters in a next-token prediction setting for $N$ input sequences of length $n$.
We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $tildeO(sqrtnN)$ parameters are not only sufficient, but also necessary for Transformers with hardmax.
- Score: 32.01426831450348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $\tilde{O}(\sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $\tilde{O}(\sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.
Related papers
- Attention Learning is Needed to Efficiently Learn Parity Function [6.944372188747803]
We analyze transformers on the $k$-parity problem.
We prove that transformers require only $O(k)$ parameters, surpassing the theoretical lower bound needed by FFNNs.
arXiv Detail & Related papers (2025-02-11T13:41:30Z) - Exact Sequence Classification with Hardmax Transformers [0.0]
We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $mathbbRd$, $dgeq 2$.
Specifically, given $N$ sequences with an arbitrary but finite length in $mathbbRd$, we construct a transformer with $mathcalO(N)$ blocks and $mathcalO(Nd)$ parameters perfectly classifying this dataset.
arXiv Detail & Related papers (2025-02-04T12:31:00Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - Breaking Symmetry When Training Transformers [3.434553688053531]
We show that the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2,..., n-1$.
We elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important.
arXiv Detail & Related papers (2024-02-06T00:32:28Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.