Softmax Transformers are Turing-Complete
- URL: http://arxiv.org/abs/2511.20038v1
- Date: Tue, 25 Nov 2025 08:08:39 GMT
- Title: Softmax Transformers are Turing-Complete
- Authors: Hongjian Jiang, Michael Hahn, Georg Zetzsche, Anthony Widjaja Lin,
- Abstract summary: We prove length-generalizable softmax CoT transformers are Turing-complete.<n>We show this is not Turing-complete for arbitrary languages.<n>We empirically validate our theory by training transformers for languages requiring complex arithmetic reasoning.
- Score: 4.231989115090749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete. More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for letter-bounded languages). While we show this is not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theory by training transformers for languages requiring complex (non-linear) arithmetic reasoning.
Related papers
- Context-Free Recognition with Transformers [57.46376097734401]
We show that looped transformers with $mathcalO(log n)$ looping layers and $mathcalO(n6)$ padding tokens can recognize all CFLs.<n>We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth.
arXiv Detail & Related papers (2026-01-05T03:14:23Z) - Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization [53.89723291716722]
A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT)<n>We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT.<n>We provide the first optimization guarantee that constant-depth transformers provably learn $mathsfNC1$-complete problems with CoT.
arXiv Detail & Related papers (2025-11-10T18:40:24Z) - Constant Bit-size Transformers Are Turing Complete [6.70318953627082]
We prove that any Turing machine running on inputs of arbitrary length can be simulated by a constant bit-size transformer.<n>Our approach relies on simulating Post machines, a Turing-complete computational model.
arXiv Detail & Related papers (2025-05-22T02:45:38Z) - Ask, and it shall be given: On the Turing completeness of prompting [47.08833920586575]
Large language models (LLMs) have been revolutionizing machine learning and initiated the so-called LLM prompting paradigm.<n>Here, we present the first theoretical study on the LLM prompting paradigm to the best of our knowledge.<n>We show that prompting is Turing-complete: there exists a finite-size Transformer such that for any computable function, there exists a corresponding prompt following which the Transformer computes the function.
arXiv Detail & Related papers (2024-11-04T11:26:38Z) - Universal Length Generalization with Turing Programs [24.077722969687898]
We propose Turing Programs, a strategy that decomposes an algorithmic task into steps mimicking a Turing Machine.
By using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks.
We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task.
arXiv Detail & Related papers (2024-07-03T17:53:44Z) - On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning [87.73401758641089]
Chain-of-thought (CoT) reasoning has improved the performance of modern language models (LMs)<n>We show that LMs can represent the same family of distributions over strings as probabilistic Turing machines.
arXiv Detail & Related papers (2024-06-20T10:59:02Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - The Expressive Power of Transformers with Chain of Thought [29.839710738657203]
In practice, transformers can be improved by allowing them to use a "chain of thought" or "scratchpad"
We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation.
Our results also imply that linear steps keep transformer decoders within context-sensitive languages.
arXiv Detail & Related papers (2023-10-11T22:35:18Z) - How Powerful are Decoder-Only Transformer Neural Models? [0.0]
This is the first work to address the Turing completeness of the underlying technology employed in GPT-x.
We show that the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold.
arXiv Detail & Related papers (2023-05-26T15:35:43Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - On the Computational Power of Transformers and its Implications in
Sequence Modeling [10.497742214344855]
In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear.
We provide an alternate and simpler proof to show that vanilla Transformers are Turing-complete.
We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary.
arXiv Detail & Related papers (2020-06-16T16:27:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.