The Counting Power of Transformers
- URL: http://arxiv.org/abs/2505.11199v2
- Date: Fri, 26 Sep 2025 22:10:51 GMT
- Title: The Counting Power of Transformers
- Authors: Marco Sälzer, Chris Köcher, Alexander Kozachinskiy, Georg Zetzsche, Anthony Widjaja Lin,
- Abstract summary: We provide a formal framework for investigating the counting power of transformers.<n>Our main result is that transformers can express counting properties that are highly nonlinear.<n>More precisely, we prove that transformers can capture all semialgebraic counting properties.
- Score: 45.96383652484399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Counting properties (e.g. determining whether certain tokens occur more than other tokens in a given input text) have played a significant role in the study of expressiveness of transformers. In this paper, we provide a formal framework for investigating the counting power of transformers. We argue that all existing results demonstrate transformers' expressivity only for (semi-)linear counting properties, i.e., which are expressible as a boolean combination of linear inequalities. Our main result is that transformers can express counting properties that are highly nonlinear. More precisely, we prove that transformers can capture all semialgebraic counting properties, i.e., expressible as a boolean combination of arbitrary multivariate polynomials (of any degree). Among others, these generalize the counting properties that can be captured by C-RASP softmax transformers, which capture only linear counting properties. To complement this result, we exhibit a natural subclass of (softmax) transformers that completely characterizes semialgebraic counting properties. Through connections with the Hilbert's tenth problem, this expressivity of transformers also yields a new undecidability result for analyzing an extremely simple transformer model -- surprisingly with neither positional encodings (i.e. NoPE-transformers) nor masking. We also experimentally validate trainability of such counting properties.
Related papers
- On the Expressive Power of Floating-Point Transformers [12.42591017155152]
We investigate the representability of floating-point transformers that use floating-point parameters and floating-point operations.<n>We show that floating-point transformers can represent a class of non-permutation-equivariant functions even without positional encoding.
arXiv Detail & Related papers (2026-01-23T05:03:00Z) - Probability Distributions Computed by Hard-Attention Transformers [53.17368795629463]
We show that making transformer language recognizers autoregressive can sometimes increase their expressivity.<n>Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
arXiv Detail & Related papers (2025-10-31T02:41:05Z) - Transformers are Inherently Succinct [46.836122954309566]
We prove that transformers are highly expressive in that they can represent formal languages substantially more succinctly than standard representations of formal languages.<n>As a by-product of this expressivity, we show that verifying properties of transformers is provably intractable.
arXiv Detail & Related papers (2025-10-22T07:25:54Z) - PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z) - Concise One-Layer Transformers Can Do Function Evaluation (Sometimes) [1.157192696857674]
This paper contributes to the study of the expressive capacity of transformers.<n>We focus on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument.
arXiv Detail & Related papers (2025-03-28T01:40:23Z) - Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size [6.9060054915724]
Transformers are a type of neural network that have demonstrated remarkable performance across various domains.<n>In this study, we investigate the ability of transformers to approximate column-symmetrics.
arXiv Detail & Related papers (2025-02-17T05:56:11Z) - Extracting Finite State Machines from Transformers [0.3069335774032178]
We investigate the trainability of transformers trained on regular languages from a mechanistic interpretability perspective.
We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state.
Our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation.
arXiv Detail & Related papers (2024-10-08T13:43:50Z) - Transformers are Efficient Compilers, Provably [11.459397066286822]
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks.<n>In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective.<n>We introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages.
arXiv Detail & Related papers (2024-10-07T20:31:13Z) - Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models.
We study the length generalization property of Transformers without position encodings.
We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z) - The Expressive Power of Transformers with Chain of Thought [29.839710738657203]
In practice, transformers can be improved by allowing them to use a "chain of thought" or "scratchpad"
We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation.
Our results also imply that linear steps keep transformer decoders within context-sensitive languages.
arXiv Detail & Related papers (2023-10-11T22:35:18Z) - Attention Is Not All You Need Anymore [3.9693969407364427]
We propose a family of drop-in replacements for the self-attention mechanism in the Transformer.
Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer.
The proposed Extractors have the potential or are able to run faster than the self-attention mechanism.
arXiv Detail & Related papers (2023-08-15T09:24:38Z) - The Impact of Positional Encoding on Length Generalization in
Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches.
Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - CoPE: Conditional image generation using Polynomial Expansions [50.67390290190874]
We introduce a general framework, called CoPE, that enables an expansion of two input variables and captures their auto- and cross-correlations.
CoPE is evaluated in five tasks (class-generation, inverse problems, edges-to-image translation, image-to-image translation, attribute-guided generation) involving eight datasets.
The thorough evaluation suggests that CoPE can be useful for tackling diverse conditional generation tasks.
arXiv Detail & Related papers (2021-04-11T19:02:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.