Transformers Learn Shortcuts to Automata
- URL: http://arxiv.org/abs/2210.10749v2
- Date: Tue, 2 May 2023 14:16:15 GMT
- Title: Transformers Learn Shortcuts to Automata
- Authors: Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril
Zhang
- Abstract summary: We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
- Score: 52.015990420075944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Algorithmic reasoning requires capabilities which are most naturally
understood through recurrent models of computation, like the Turing machine.
However, Transformer models, while lacking recurrence, are able to perform such
reasoning using far fewer layers than the number of reasoning steps. This
raises the question: what solutions are learned by these shallow and
non-recurrent models? We find that a low-depth Transformer can represent the
computations of any finite-state automaton (thus, any bounded-memory
algorithm), by hierarchically reparameterizing its recurrent dynamics. Our
theoretical results characterize shortcut solutions, whereby a Transformer with
$o(T)$ layers can exactly replicate the computation of an automaton on an input
sequence of length $T$. We find that polynomial-sized $O(\log T)$-depth
solutions always exist; furthermore, $O(1)$-depth simulators are surprisingly
common, and can be understood using tools from Krohn-Rhodes theory and circuit
complexity. Empirically, we perform synthetic experiments by training
Transformers to simulate a wide variety of automata, and show that shortcut
solutions can be learned via standard training. We further investigate the
brittleness of these solutions and propose potential mitigations.
Related papers
- Simulating Weighted Automata over Sequences and Trees with Transformers [5.078561931628571]
We show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA)
We prove these claims formally and provide upper bounds on the sizes of the transformer models needed as a function of the number of states the target automata.
arXiv Detail & Related papers (2024-03-12T21:54:34Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits.
This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z) - Mixability made efficient: Fast online multiclass logistic regression [68.8204255655161]
We show that mixability can be a powerful tool to obtain algorithms with optimal regret.
The resulting methods often suffer from high computational complexity which has reduced their practical applicability.
arXiv Detail & Related papers (2021-10-08T08:22:05Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - Sub-Linear Memory: How to Make Performers SLiM [38.068090269482425]
vanilla Transformers require $O(L2)$ in serial time and memory as functions of input length $L$.
Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation.
We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory.
arXiv Detail & Related papers (2020-12-21T13:56:04Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.