Related papers: Transformers as Transducers

Transformers as Transducers

URL: http://arxiv.org/abs/2404.02040v1
Date: Tue, 2 Apr 2024 15:34:47 GMT
Title: Transformers as Transducers
Authors: Lena Strobl, Dana Angluin, David Chiang, Jonathan Rawski, Ashish Sabharwal,
Abstract summary: We study the sequence-to-sequence mapping capacity of transformers. We find that they can express surprisingly large classes of transductions. A new proof that transformer decoders are Turing-complete.
Score: 27.48483887144685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the sequence-to-sequence mapping capacity of transformers by relating them to finite transducers, and find that they can express surprisingly large classes of transductions. We do so using variants of RASP, a programming language designed to help people "think like transformers," as an intermediate representation. We extend the existing Boolean variant B-RASP to sequence-to-sequence functions and show that it computes exactly the first-order rational functions (such as string rotation). Then, we introduce two new extensions. B-RASP[pos] enables calculations on positions (such as copying the first half of a string) and contains all first-order regular functions. S-RASP adds prefix sum, which enables additional arithmetic operations (such as squaring a string) and contains all first-order polyregular functions. Finally, we show that masked average-hard attention transformers can simulate S-RASP. A corollary of our results is a new proof that transformer decoders are Turing-complete.

Related papers

Transformers are Efficient Compilers, Provably [11.459397066286822]
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks. In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective. We introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages.
arXiv Detail & Related papers (2024-10-07T20:31:13Z)
Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized. We find that these random transformers can perform a wide range of meaningful algorithmic tasks. Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z)
Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers [8.908747084128397]
We introduce the temporal counting logic $textsfK_textt$[#] alongside the RASP variant $textsfC-RASP$. We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers with unbounded input size.
arXiv Detail & Related papers (2024-04-05T20:36:30Z)
Prompting a Pretrained Transformer Can Be a Universal Approximator [105.59562522323274]
We show that much smaller pretrained models than previously thought can be universal approximators when prefixed. We also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.
arXiv Detail & Related papers (2024-02-22T18:12:48Z)
Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism. We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z)
Can Transformers Learn Sequential Function Classes In Context? [0.0]
In-context learning (ICL) has revolutionized the capabilities of transformer models in NLP. We introduce a novel sliding window sequential function class and employ toy-sized transformers with a GPT-2 architecture to conduct our experiments. Our analysis indicates that these models can indeed leverage ICL when trained on non-textual sequential function classes.
arXiv Detail & Related papers (2023-12-19T22:57:13Z)
Sumformer: Universal Approximation for Efficient Transformers [2.4832703558223725]
We introduce Sumformer, a novel and simple architecture capable of universally approxingimating sequence-to-sequence functions. We derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
arXiv Detail & Related papers (2023-07-05T13:59:35Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
Glushkov's construction for functional subsequential transducers [91.3755431537592]
Glushkov's construction has many interesting properties and they become even more evident when applied to transducers. Special flavour of regular expressions is introduced, which can be efficiently converted to $epsilon$-free functional subsequential weighted finite state transducers.
arXiv Detail & Related papers (2020-08-05T17:09:58Z)
Multi-level Head-wise Match and Aggregation in Transformer for Textual Sequence Matching [87.97265483696613]
We propose a new approach to sequence pair matching with Transformer, by learning head-wise matching representations on multiple levels. Experiments show that our proposed approach can achieve new state-of-the-art performance on multiple tasks.
arXiv Detail & Related papers (2020-01-20T20:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.