Transformers Learn Shortcuts to Automata
- URL: http://arxiv.org/abs/2210.10749v2
- Date: Tue, 2 May 2023 14:16:15 GMT
- Title: Transformers Learn Shortcuts to Automata
- Authors: Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril
Zhang
- Abstract summary: We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
- Score: 52.015990420075944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Algorithmic reasoning requires capabilities which are most naturally
understood through recurrent models of computation, like the Turing machine.
However, Transformer models, while lacking recurrence, are able to perform such
reasoning using far fewer layers than the number of reasoning steps. This
raises the question: what solutions are learned by these shallow and
non-recurrent models? We find that a low-depth Transformer can represent the
computations of any finite-state automaton (thus, any bounded-memory
algorithm), by hierarchically reparameterizing its recurrent dynamics. Our
theoretical results characterize shortcut solutions, whereby a Transformer with
$o(T)$ layers can exactly replicate the computation of an automaton on an input
sequence of length $T$. We find that polynomial-sized $O(\log T)$-depth
solutions always exist; furthermore, $O(1)$-depth simulators are surprisingly
common, and can be understood using tools from Krohn-Rhodes theory and circuit
complexity. Empirically, we perform synthetic experiments by training
Transformers to simulate a wide variety of automata, and show that shortcut
solutions can be learned via standard training. We further investigate the
brittleness of these solutions and propose potential mitigations.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds.
We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z) - Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time.
By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z) - Simulating Weighted Automata over Sequences and Trees with Transformers [5.078561931628571]
We show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA)
We prove these claims formally and provide upper bounds on the sizes of the transformer models needed as a function of the number of states the target automata.
arXiv Detail & Related papers (2024-03-12T21:54:34Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits.
This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.