Related papers: Theoretical limitations of multi-layer Transformer

Theoretical limitations of multi-layer Transformer

URL: http://arxiv.org/abs/2412.02975v1
Date: Wed, 04 Dec 2024 02:37:31 GMT
Title: Theoretical limitations of multi-layer Transformer
Authors: Lijie Chen, Binghui Peng, Hongxun Wu,
Abstract summary: We prove the first $textitunconditional$ lower bound against multi-layer decoder-only transformers.<n>We also introduce a new proof technique that finds a certain $textitindistinguishable$ $textitde$ all possible inputs.<n>We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.
Score: 14.63344366356708
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers, especially the decoder-only variants, are the backbone of most modern large language models; yet we do not have much understanding of their expressive power except for the simple $1$-layer case. Due to the difficulty of analyzing multi-layer models, all previous work relies on unproven complexity conjectures to show limitations for multi-layer Transformers. In this work, we prove the first $\textit{unconditional}$ lower bound against multi-layer decoder-only transformers. For any constant $L$, we prove that any $L$-layer decoder-only transformer needs a polynomial model dimension ($n^{\Omega(1)}$) to perform sequential composition of $L$ functions over an input of $n$ tokens. As a consequence, our results give: (1) the first depth-width trade-off for multi-layer transformers, exhibiting that the $L$-step composition task is exponentially harder for $L$-layer models compared to $(L+1)$-layer ones; (2) an unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought. On the technical side, we propose the multi-party $\textit{autoregressive}$ $\textit{communication}$ $\textit{model}$ that captures the computation of a decoder-only Transformer. We also introduce a new proof technique that finds a certain $\textit{indistinguishable}$ $\textit{decomposition}$ of all possible inputs iteratively for proving lower bounds in this model. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.

Related papers

Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent [15.291830857281015]
We investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods. Our analysis demonstrates that even after $mathrmpoly(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large.
arXiv Detail & Related papers (2025-04-07T03:08:12Z)
Provably Overwhelming Transformer Models with Designed Inputs [0.0]
We say that $mathcalM$ is overwhelmed'' by $s$ when the output of the model evaluated on this string plus any additional string $t$, $mathcalM(s + t)$, is completely insensitive to the value of the string $t$ whenever length($t$) $leq n_free$. We prove a particularly strong worst-case form of over-squashing'', which we use to bound the model's behavior.
arXiv Detail & Related papers (2025-02-09T21:21:57Z)
Circuit Complexity Bounds for RoPE-based Transformer Architecture [25.2590541420499]
Empirical evidence suggests that $mathsfRoPE$-based Transformer architectures demonstrate greater generalization capabilities than conventional Transformer models.<n>We show that unless $mathsfTC0 = mathsfNC1$, a $mathsfRoPE$-based Transformer with $mathrmpoly(n)$-precision, $O(1)$ layers, hidden dimension $d leq O(n)$ cannot solve the Arithmetic formula evaluation problem.
arXiv Detail & Related papers (2024-11-12T07:24:41Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton. We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
Reformer: The Efficient Transformer [21.425616422007543]
We introduce two techniques to improve the efficiency of Transformers. We replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L2$) to O($Llog L$, where $L$ is the length of the sequence. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
arXiv Detail & Related papers (2020-01-13T18:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.